Introduction
Large Language Models (LLMs) such as GPT‑4, LLaMA, and BLOOM have transformed natural‑language processing, but training and serving them at scale demands massive GPU resources, high‑speed networking, and sophisticated orchestration. Amazon Elastic Kubernetes Service (EKS) provides a managed, production‑grade Kubernetes platform that can run distributed GPU workloads, while integrating tightly with AWS services for security, observability, and cost management.
This article walks you through end‑to‑end optimization of distributed GPU workloads for LLMs on Amazon EKS. We’ll cover:
- The architectural foundations of distributed training on GPUs.
- How to provision an EKS cluster with GPU‑enabled node groups.
- Deploying NVIDIA’s GPU Operator and configuring Kubernetes for high‑performance inter‑node communication.
- Practical examples using PyTorch Distributed Data Parallel (DDP) and Hugging Face Accelerate.
- Performance‑tuning techniques—network, storage, pod placement, mixed‑precision, and gradient checkpointing.
- Monitoring, autoscaling, and cost‑optimization strategies.
- A real‑world case study and a checklist of best practices.
By the end of this guide, you’ll have a reproducible, production‑ready setup that can train or serve multi‑billion‑parameter LLMs on Amazon EKS with optimal throughput and cost efficiency.
1. Background: Why Distributed GPU for LLMs?
1.1 Scale of Modern LLMs
| Model | Parameters | Typical GPU Memory Requirement (VRAM) |
|---|---|---|
| GPT‑3 175B | 175 B | > 80 GB (needs model parallelism) |
| LLaMA 13B | 13 B | ~ 24 GB (fits on a single A100) |
| BLOOM 176B | 176 B | > 80 GB (requires tensor‑parallelism) |
Training a model beyond 10 B parameters quickly exceeds the memory of a single GPU, necessitating model parallelism (splitting a single model across multiple GPUs) and data parallelism (replicating the model on multiple GPUs and aggregating gradients). Distributed training frameworks—PyTorch DDP, DeepSpeed, and Hugging Face Accelerate—rely on low‑latency inter‑connects (NVLink, InfiniBand, or high‑throughput Ethernet) to keep synchronization overhead minimal.
1.2 Challenges in the Cloud
- Network Latency & Bandwidth – GPU‑to‑GPU communication can become the bottleneck if the underlying network is sub‑optimal.
- Pod Scheduling – Ensuring that GPUs that need to talk to each other are placed on the same EC2 instance or within the same placement group.
- Resource Fragmentation – Over‑provisioning leads to wasted money; under‑provisioning throttles training speed.
- Observability – Distributed jobs generate large volumes of metrics and logs; centralizing them is essential for troubleshooting.
Amazon EKS, when combined with the NVIDIA GPU Operator, provides the necessary primitives to address these challenges while letting you focus on the model logic.
2. Amazon EKS Overview for GPU Workloads
2.1 Managed Control Plane
EKS abstracts away the Kubernetes control plane (API server, etcd, controller manager). AWS automatically patches and scales it, ensuring high availability. For GPU workloads you only need to manage worker nodes.
2.2 Integration Points
| Service | Role in GPU Workloads |
|---|---|
| Amazon EC2 | Provides GPU‑enabled instance families (p4, p5, g5, g6). |
| AWS VPC & Subnets | Enables placement groups for low‑latency networking. |
| IAM Roles for Service Accounts (IRSA) | Securely grants pods access to S3, EFS, or Secrets Manager. |
| Amazon CloudWatch & Container Insights | Collects logs and metrics from the cluster. |
| AWS Autoscaling | Scales node groups based on GPU utilization. |
2.3 GPU‑Ready Instance Types
| Instance | GPU | vCPU | Memory | Network |
|---|---|---|---|---|
| p4d.24xlarge | 8 × NVIDIA A100 40 GB | 96 | 1.1 TB | 400 Gbps (Elastic Fabric Adapter) |
| p5.48xlarge | 8 × NVIDIA H100 80 GB | 192 | 3.8 TB | 400 Gbps (EFA) |
| g5.16xlarge | 4 × NVIDIA A10G 24 GB | 64 | 256 GB | 100 Gbps |
| g6.12xlarge | 4 × NVIDIA RTX A6000 48 GB | 48 | 384 GB | 100 Gbps |
For multi‑node training, Elastic Fabric Adapter (EFA) provides the low‑latency, high‑throughput networking required for efficient collective operations (e.g., NCCL All‑Reduce).
3. Architecture for Distributed GPU on EKS
Below is a high‑level diagram of the recommended architecture:
+-------------------+ +-------------------+ +-------------------+
| EKS Control Plane| <----> | Managed Node | <----> | Managed Node |
| (API Server) | | Group (GPU) | | Group (GPU) |
+-------------------+ +-------------------+ +-------------------+
^ ^ ^ ^ ^ ^ ^
| | | | | | |
| +-------------------+ | | | +----------------+ | | |
+->| NVIDIA GPU |<---+ | +--->| NVIDIA GPU |<---+ |
Operator (Daemon) | Operator (Daemon) |
| |
+----------------------+ +----------------------+ +----------------------+
| Persistent Storage | | Monitoring (Prom) | | Model Artifact S3 |
| (EFS/EBS) | | & CloudWatch | | (Versioned) |
+----------------------+ +----------------------+ +----------------------+
- GPU Operator installs NVIDIA drivers, the NVIDIA Container Toolkit, and the device plugin automatically.
- EFA-enabled ENIs are attached to each node for high‑performance NCCL communication.
- Placement groups (cluster or spread) keep nodes physically close to reduce network hops.
- Shared storage (EFS or FSx for Lustre) holds model checkpoints and dataset shards.
- Prometheus + Grafana scrapes GPU metrics (
nvidia.com/gpu),container_cpu_usage_seconds_total, and NCCL counters.
4. Provisioning an EKS Cluster with GPU Nodes
4.1 Prerequisites
- AWS CLI v2
eksctl(>=0.140.0)kubectl- IAM permissions to create EKS clusters, EC2 instances, and IAM roles
4.2 Cluster Creation Script
#!/usr/bin/env bash
set -euo pipefail
CLUSTER_NAME="llm-gpu-eks"
REGION="us-east-1"
VERSION="1.30" # latest stable at time of writing
# Create the EKS control plane
eksctl create cluster \
--name $CLUSTER_NAME \
--region $REGION \
--version $VERSION \
--with-oidc \
--ssh-access \
--ssh-public-key ~/.ssh/id_rsa.pub \
--nodegroup-name cpu-ng \
--node-type m5.large \
--nodes 3 \
--managed
# Create a GPU node group with EFA enabled
eksctl create nodegroup \
--cluster $CLUSTER_NAME \
--region $REGION \
--name gpu-ng \
--node-type p4d.24xlarge \
--nodes 4 \
--managed \
--ssh-access \
--ssh-public-key ~/.ssh/id_rsa.pub \
--asg-access \
--instance-prefix gpu \
--efa-enabled \
--max-pods-per-node 110
Key flags explained
--with-oidc– Enables IAM roles for service accounts (IRSA).--efa-enabled– Adds an Elastic Fabric Adapter to each EC2 instance.--max-pods-per-node– Increases the pod limit to accommodate many training workers.
4.3 Verify GPU Nodes
kubectl get nodes -L karpenter.sh/capacity-type,karpenter.sh/nodepool
kubectl describe node <gpu-node-name> | grep -i nvidia
You should see nvidia.com/gpu: 8 under Allocatable.
5. Installing the NVIDIA GPU Operator
The GPU Operator automates driver installation, the device plugin, and the NVIDIA Container Toolkit.
# Add the Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install the operator in the kube-system namespace
helm install gpu-operator nvidia/gpu-operator \
--namespace kube-system \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set migManager.enabled=false \
--set validator.enabled=true
Post‑install validation
kubectl get pods -n kube-system -l app=nvidia-driver
kubectl get pods -n kube-system -l app=nvidia-device-plugin
kubectl describe ds nvidia-device-plugin-daemonset -n kube-system | grep "GPU"
All GPU nodes should now report nvidia.com/gpu resources.
6. Deploying a Distributed LLM Training Job
6.1 Choosing a Framework
- PyTorch DDP – Classic, low‑level API.
- DeepSpeed – Optimized for large models (ZeRO, offloading).
- Hugging Face Accelerate – Simplifies multi‑GPU/TPU setup.
For this guide we’ll use Hugging Face Accelerate with DeepSpeed integration, because it provides a concise YAML config and works out‑of‑the‑box on EKS.
6.2 Sample Training Script
Save the following as train_llm.py:
# train_llm.py
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from accelerate import Accelerator
def main():
accelerator = Accelerator()
model_name = os.getenv("MODEL_NAME", "EleutherAI/gpt-neox-20b")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load model in fp16
with accelerator.main_process_first():
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Dummy dataset – replace with real data loader
def dummy_data():
for _ in range(1000):
yield {"input_ids": tokenizer.encode("Hello world!", return_tensors="pt")[0]}
train_dataset = list(dummy_data())
training_args = TrainingArguments(
output_dir="/opt/ml/checkpoints",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
deepspeed="ds_config.json",
report_to="none",
dataloader_num_workers=2,
logging_steps=10,
save_steps=100,
max_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
trainer.train()
if __name__ == "__main__":
main()
Create a ds_config.json for DeepSpeed ZeRO‑3 offloading:
{
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"gradient_clipping": 1.0,
"train_batch_size": 8,
"gradient_accumulation_steps": 4,
"fp16": {
"enabled": true
}
}
6.3 Containerizing the Training Code
Dockerfile:
FROM nvcr.io/nvidia/pytorch:24.01-py3
# Install accelerator & deepspeed
RUN pip install --no-cache-dir \
accelerate==0.34.0 \
deepspeed==0.14.0 \
transformers==4.44.0 \
datasets==3.0.0
WORKDIR /opt/ml
COPY train_llm.py ds_config.json ./
ENTRYPOINT ["python", "train_llm.py"]
Build and push to Amazon ECR:
aws ecr create-repository --repository-name llm-trainer
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=us-east-1
REPO_URI=${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/llm-trainer
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $REPO_URI
docker build -t $REPO_URI:latest .
docker push $REPO_URI:latest
6.4 Kubernetes Job Manifest
apiVersion: batch/v1
kind: Job
metadata:
name: llm-train-job
spec:
completions: 1
parallelism: 1
template:
spec:
restartPolicy: Never
serviceAccountName: llm-trainer-sa # created with IRSA for S3 access
containers:
- name: trainer
image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/llm-trainer:latest
resources:
limits:
nvidia.com/gpu: 8 # one p4d node = 8 A100 GPUs
requests:
nvidia.com/gpu: 8
env:
- name: MODEL_NAME
value: "EleutherAI/gpt-neox-20b"
- name: AWS_DEFAULT_REGION
value: "us-east-1"
volumeMounts:
- name: checkpoint
mountPath: /opt/ml/checkpoints
volumes:
- name: checkpoint
persistentVolumeClaim:
claimName: llm-checkpoint-pvc
Notes
parallelism: 1runs a single pod that utilizes all 8 GPUs on the node (single‑node multi‑GPU). For multi‑node training, setparallelismto the number of nodes and usetorchrunwith--nnodesand--node_rank.- The PVC can be backed by FSx for Lustre for high‑throughput checkpoint writes.
6.5 Multi‑Node Example (torchrun)
If you need to span 2 p4d nodes (16 GPUs), adjust the manifest:
apiVersion: batch/v1
kind: Job
metadata:
name: llm-train-multi
spec:
completions: 2
parallelism: 2
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: <ECR_URI>:latest
command: ["torchrun"]
args:
- "--nnodes=2"
- "--nproc_per_node=8"
- "--rdzv_id=llm-rdzv"
- "--rdzv_backend=c10d"
- "--rdzv_endpoint=$(MY_POD_IP):29500"
- "train_llm.py"
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
volumeMounts:
- name: checkpoint
mountPath: /opt/ml/checkpoints
volumes:
- name: checkpoint
persistentVolumeClaim:
claimName: llm-checkpoint-pvc
Important: Ensure the EFA security group allows traffic on the chosen rendezvous port (default 29500). Use a headless Service to expose the rendezvous address if you need stable DNS.
7. Performance Optimization Techniques
7.1 Network Tuning
| Setting | Recommendation |
|---|---|
| EFA | Use EFA-enabled instances (p4d, p5) and enable --rdzv_backend=c10d. |
| Placement Groups | Deploy GPU nodes in a cluster placement group to keep them on the same rack, reducing latency. |
| MTU | Set network MTU to 9000 (jumbo frames) on the ENIs for NCCL. |
| NCCL_DEBUG | Set NCCL_DEBUG=INFO and NCCL_SOCKET_IFNAME=eth0 to troubleshoot. |
7.2 Pod Placement & Affinity
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- {key: "app", operator: In, values: ["llm-trainer"]}
topologyKey: "kubernetes.io/hostname"
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- {key: "beta.kubernetes.io/instance-type", operator: In, values: ["p4d.24xlarge"]}
- Pod anti‑affinity forces each trainer pod onto a different node, preventing GPU contention.
- Node affinity guarantees placement on GPU‑enabled instances.
7.3 Mixed Precision & Gradient Checkpointing
- Automatic Mixed Precision (AMP) – Already enabled via
fp16=TrueinTrainingArguments. Reduces memory by ~2×. - Gradient Checkpointing – Add
model.gradient_checkpointing_enable()before training to trade compute for memory, allowing larger batch sizes.
model.gradient_checkpointing_enable()
7.4 Resource Requests & Limits
| Resource | Recommended Value |
|---|---|
nvidia.com/gpu | Exact count per node (e.g., 8) |
cpu | 2 * GPU count (e.g., 16) for data loading |
memory | 64Gi per node for large batch buffers |
ephemeral-storage | 100Gi for temporary checkpoint files |
Over‑requesting CPU can starve other system components; keep it tight.
7.5 Checkpoint Offloading
DeepSpeed’s ZeRO‑3 offloads parameters and optimizer states to host memory (or even NVMe). For p4d nodes with abundant DRAM, host offload works well. If you need more capacity, mount an NVMe local SSD as a hostPath volume and point offload_param.device to it.
"offload_param": {
"device": "nvme",
"pin_memory": true
}
7.6 Autoscaling GPU Nodes
Use Cluster Autoscaler with GPU support:
# configmap for cluster-autoscaler
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-aws-cluster-autoscaler
namespace: kube-system
data:
aws-use-static-instance-list: "false"
balance-similar-node-groups: "true"
expander: "least-waste"
Add node selector label karpenter.sh/capacity-type=spot to a spot GPU node group for cost savings. Combine with node taints (gpu:NoSchedule) so only training pods can consume spot GPUs.
8. Monitoring, Logging, and Debugging
8.1 Prometheus + Grafana
Deploy the kube‑prometheus‑stack Helm chart:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
Add a ServiceMonitor for the NVIDIA device plugin:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nvidia-device-plugin
labels:
release: kube-prometheus
spec:
selector:
matchLabels:
app: nvidia-device-plugin
endpoints:
- port: metrics
interval: 30s
Create dashboards for:
nvidia_gpu_duty_cycle– GPU utilization.nvidia_gpu_memory_total– VRAM usage.nccl_all_reduce_time_seconds_total– NCCL collective latency.
8.2 CloudWatch Container Insights
Enable Container Insights on the EKS cluster:
aws eks update-cluster-config \
--region $REGION \
--name $CLUSTER_NAME \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
Set up alarms for:
- GPU memory > 90% for > 5 mins.
- CPU throttling > 80%.
- Network throughput < 70% of EFA capacity.
8.3 Debugging NCCL Issues
Common error: NCCL WARN Call to socket failed: Connection reset by peer. Fixes:
- Verify security group allows inbound/outbound traffic on the NCCL port range (default 1‑65535, but you can limit via
NCCL_SOCKET_IFNAME). - Ensure EFA is attached to the ENI and the ENI is in the same subnet as other GPU nodes.
- Set
NCCL_DEBUG_SUBSYS=ALLto get detailed logs.
9. Cost‑Optimization Strategies
| Technique | How It Works | When to Use |
|---|---|---|
| Spot Instances | Up to 90% discount on unused EC2 capacity. | Non‑critical training, checkpoint‑able jobs. |
| Mixed‑Instance Node Groups | Combine on‑demand and spot for resilience. | Long‑running training where occasional pre‑emptions are acceptable. |
| Right‑Sizing | Choose p4de.24xlarge (A100 80 GB) only when model > 30 B parameters. | Smaller models (≤ 10 B) can run on g5.12xlarge. |
| Elastic Scaling | Scale node group based on GPU utilization > 70%. | Variable workloads, e.g., inference bursts. |
| Savings Plans | Commit to a consistent amount of compute (e.g., $/hour). | Predictable, steady training pipelines. |
| Checkpoint Offloading | Store checkpoints on S3 Glacier after each epoch. | Reduces EFS/FSx storage costs. |
Spot interruption handling – Use --max-retries in DeepSpeed and torch.distributed.elastic to automatically restart the job from the latest checkpoint.
10. Real‑World Case Study: Scaling LLaMA‑13B on EKS
Background
A fintech startup needed to fine‑tune LLaMA‑13B on a proprietary financial corpus (~150 GB). Their goals:
- Reduce training time from 48 h (single‑node) to < 12 h.
- Keep AWS spend under $8,000 for a single experiment.
- Maintain reproducibility across runs.
Solution Architecture
- Cluster – 3 × p4d.24xlarge nodes in a cluster placement group (24 GPUs total). Spot capacity was used for two nodes, on‑demand for the master node.
- Networking – EFA enabled, MTU set to 9000, security groups opened on port 29500.
- Storage – FSx for Lustre (2 TB) for fast checkpoint I/O; S3 for raw data.
- Software Stack – NVIDIA GPU Operator, DeepSpeed ZeRO‑3, Hugging Face Accelerate, custom
torchrunscript. - Autoscaling – Cluster Autoscaler with a target GPU utilization of 80%; added a fourth spot node after the first epoch to boost throughput.
- Monitoring – Grafana dashboard showed average NCCL All‑Reduce latency of 0.8 ms, GPU utilization at 92%.
Results
| Metric | Before (single‑node) | After (EKS multi‑node) |
|---|---|---|
| Total wall‑time | 48 h | 10.8 h |
| GPU cost | $6,500 | $7,800 (including spot discounts) |
| Checkpoint latency | 30 s | 5 s |
| Failure rate | 0 | 1 pre‑empted spot node (auto‑recovered) |
The experiment demonstrated that proper placement groups, EFA, and DeepSpeed ZeRO‑3 can cut training time by > 75% while staying within a modest budget.
11. Best‑Practice Checklist
- Cluster Setup
- Use EFA‑enabled GPU instances (
p4d,p5). - Deploy nodes in a cluster placement group.
- Enable IRSA for secure S3/EFS access.
- Use EFA‑enabled GPU instances (
- GPU Operator
- Verify
nvidia.com/gpuappears in node allocatable. - Enable
validatorto catch driver mismatches.
- Verify
- Networking
- Set MTU to 9000 on ENIs.
- Open NCCL ports (default 29500) in security groups.
- Pod Specs
- Pin exact GPU count in
limitsandrequests. - Add node affinity for GPU instance types.
- Use anti‑affinity to avoid GPU contention.
- Pin exact GPU count in
- Training Configuration
- Enable AMP (
fp16=True). - Turn on gradient checkpointing if memory‑bound.
- Use DeepSpeed ZeRO‑3 with host/offload to reduce VRAM pressure.
- Enable AMP (
- Storage
- Store checkpoints on FSx for Lustre or high‑throughput EBS.
- Archive old checkpoints to S3 Glacier.
- Monitoring
- Install Prometheus + Grafana with NVIDIA metrics.
- Set CloudWatch alarms for GPU utilization and NCCL latency.
- Autoscaling & Cost
- Enable Cluster Autoscaler with GPU metrics.
- Mix spot and on‑demand nodes.
- Apply Savings Plans for predictable workloads.
- Reliability
- Use torch.distributed.elastic for automatic restart.
- Persist checkpoints frequently (every N steps).
Conclusion
Training and serving large language models at scale is no longer the exclusive domain of on‑premise GPU farms. Amazon EKS, combined with NVIDIA’s GPU Operator, Elastic Fabric Adapter, and modern distributed training frameworks, offers a robust, flexible, and cost‑effective platform for LLM workloads.
By following the architecture, provisioning steps, and optimization techniques outlined in this guide, you can:
- Achieve low‑latency GPU‑to‑GPU communication via EFA and placement groups.
- Maximize GPU utilization through mixed‑precision, gradient checkpointing, and DeepSpeed’s ZeRO‑3.
- Scale elastically while controlling spend via spot instances and autoscaling.
- Maintain observability and reliability with Prometheus, CloudWatch, and fault‑tolerant job orchestration.
Whether you are fine‑tuning a 13 B model or training a 200 B LLM, the principles here provide a solid foundation for building production‑grade pipelines on AWS. Happy training!
Resources
- Amazon EKS Documentation – Official guide to creating and managing EKS clusters.
- NVIDIA GPU Operator Helm Chart – Install and configure NVIDIA drivers, device plugin, and toolkit on Kubernetes.
- DeepSpeed ZeRO Documentation – In‑depth guide on optimizer state partitioning and offloading.
- Elastic Fabric Adapter (EFA) Overview – Details on low‑latency networking for HPC and ML.
- Hugging Face Accelerate Documentation – Simplify multi‑GPU/TPU training with a unified API.