Introduction

Large Language Models (LLMs) such as GPT‑4, LLaMA, and BLOOM have transformed natural‑language processing, but training and serving them at scale demands massive GPU resources, high‑speed networking, and sophisticated orchestration. Amazon Elastic Kubernetes Service (EKS) provides a managed, production‑grade Kubernetes platform that can run distributed GPU workloads, while integrating tightly with AWS services for security, observability, and cost management.

This article walks you through end‑to‑end optimization of distributed GPU workloads for LLMs on Amazon EKS. We’ll cover:

  • The architectural foundations of distributed training on GPUs.
  • How to provision an EKS cluster with GPU‑enabled node groups.
  • Deploying NVIDIA’s GPU Operator and configuring Kubernetes for high‑performance inter‑node communication.
  • Practical examples using PyTorch Distributed Data Parallel (DDP) and Hugging Face Accelerate.
  • Performance‑tuning techniques—network, storage, pod placement, mixed‑precision, and gradient checkpointing.
  • Monitoring, autoscaling, and cost‑optimization strategies.
  • A real‑world case study and a checklist of best practices.

By the end of this guide, you’ll have a reproducible, production‑ready setup that can train or serve multi‑billion‑parameter LLMs on Amazon EKS with optimal throughput and cost efficiency.


1. Background: Why Distributed GPU for LLMs?

1.1 Scale of Modern LLMs

ModelParametersTypical GPU Memory Requirement (VRAM)
GPT‑3 175B175 B> 80 GB (needs model parallelism)
LLaMA 13B13 B~ 24 GB (fits on a single A100)
BLOOM 176B176 B> 80 GB (requires tensor‑parallelism)

Training a model beyond 10 B parameters quickly exceeds the memory of a single GPU, necessitating model parallelism (splitting a single model across multiple GPUs) and data parallelism (replicating the model on multiple GPUs and aggregating gradients). Distributed training frameworks—PyTorch DDP, DeepSpeed, and Hugging Face Accelerate—rely on low‑latency inter‑connects (NVLink, InfiniBand, or high‑throughput Ethernet) to keep synchronization overhead minimal.

1.2 Challenges in the Cloud

  • Network Latency & Bandwidth – GPU‑to‑GPU communication can become the bottleneck if the underlying network is sub‑optimal.
  • Pod Scheduling – Ensuring that GPUs that need to talk to each other are placed on the same EC2 instance or within the same placement group.
  • Resource Fragmentation – Over‑provisioning leads to wasted money; under‑provisioning throttles training speed.
  • Observability – Distributed jobs generate large volumes of metrics and logs; centralizing them is essential for troubleshooting.

Amazon EKS, when combined with the NVIDIA GPU Operator, provides the necessary primitives to address these challenges while letting you focus on the model logic.


2. Amazon EKS Overview for GPU Workloads

2.1 Managed Control Plane

EKS abstracts away the Kubernetes control plane (API server, etcd, controller manager). AWS automatically patches and scales it, ensuring high availability. For GPU workloads you only need to manage worker nodes.

2.2 Integration Points

ServiceRole in GPU Workloads
Amazon EC2Provides GPU‑enabled instance families (p4, p5, g5, g6).
AWS VPC & SubnetsEnables placement groups for low‑latency networking.
IAM Roles for Service Accounts (IRSA)Securely grants pods access to S3, EFS, or Secrets Manager.
Amazon CloudWatch & Container InsightsCollects logs and metrics from the cluster.
AWS AutoscalingScales node groups based on GPU utilization.

2.3 GPU‑Ready Instance Types

InstanceGPUvCPUMemoryNetwork
p4d.24xlarge8 × NVIDIA A100 40 GB961.1 TB400 Gbps (Elastic Fabric Adapter)
p5.48xlarge8 × NVIDIA H100 80 GB1923.8 TB400 Gbps (EFA)
g5.16xlarge4 × NVIDIA A10G 24 GB64256 GB100 Gbps
g6.12xlarge4 × NVIDIA RTX A6000 48 GB48384 GB100 Gbps

For multi‑node training, Elastic Fabric Adapter (EFA) provides the low‑latency, high‑throughput networking required for efficient collective operations (e.g., NCCL All‑Reduce).


3. Architecture for Distributed GPU on EKS

Below is a high‑level diagram of the recommended architecture:

+-------------------+        +-------------------+        +-------------------+
|  EKS Control Plane| <----> |   Managed Node    | <----> |   Managed Node    |
|   (API Server)    |        |  Group (GPU)      |        |  Group (GPU)      |
+-------------------+        +-------------------+        +-------------------+
        ^                           ^  ^  ^                     ^  ^  ^
        |                           |  |  |                     |  |  |
        |  +-------------------+    |  |  |    +----------------+ |  |  |
        +->|  NVIDIA GPU       |<---+  |  +--->|  NVIDIA GPU    |<---+  |
            Operator (Daemon)      |         Operator (Daemon)       |
                                   |                                 |
+----------------------+   +----------------------+   +----------------------+
|  Persistent Storage  |   |  Monitoring (Prom)  |   |  Model Artifact S3   |
|  (EFS/EBS)           |   |  & CloudWatch       |   |  (Versioned)         |
+----------------------+   +----------------------+   +----------------------+
  • GPU Operator installs NVIDIA drivers, the NVIDIA Container Toolkit, and the device plugin automatically.
  • EFA-enabled ENIs are attached to each node for high‑performance NCCL communication.
  • Placement groups (cluster or spread) keep nodes physically close to reduce network hops.
  • Shared storage (EFS or FSx for Lustre) holds model checkpoints and dataset shards.
  • Prometheus + Grafana scrapes GPU metrics (nvidia.com/gpu), container_cpu_usage_seconds_total, and NCCL counters.

4. Provisioning an EKS Cluster with GPU Nodes

4.1 Prerequisites

  • AWS CLI v2
  • eksctl (>=0.140.0)
  • kubectl
  • IAM permissions to create EKS clusters, EC2 instances, and IAM roles

4.2 Cluster Creation Script

#!/usr/bin/env bash
set -euo pipefail

CLUSTER_NAME="llm-gpu-eks"
REGION="us-east-1"
VERSION="1.30"   # latest stable at time of writing

# Create the EKS control plane
eksctl create cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --version $VERSION \
  --with-oidc \
  --ssh-access \
  --ssh-public-key ~/.ssh/id_rsa.pub \
  --nodegroup-name cpu-ng \
  --node-type m5.large \
  --nodes 3 \
  --managed

# Create a GPU node group with EFA enabled
eksctl create nodegroup \
  --cluster $CLUSTER_NAME \
  --region $REGION \
  --name gpu-ng \
  --node-type p4d.24xlarge \
  --nodes 4 \
  --managed \
  --ssh-access \
  --ssh-public-key ~/.ssh/id_rsa.pub \
  --asg-access \
  --instance-prefix gpu \
  --efa-enabled \
  --max-pods-per-node 110

Key flags explained

  • --with-oidc – Enables IAM roles for service accounts (IRSA).
  • --efa-enabled – Adds an Elastic Fabric Adapter to each EC2 instance.
  • --max-pods-per-node – Increases the pod limit to accommodate many training workers.

4.3 Verify GPU Nodes

kubectl get nodes -L karpenter.sh/capacity-type,karpenter.sh/nodepool
kubectl describe node <gpu-node-name> | grep -i nvidia

You should see nvidia.com/gpu: 8 under Allocatable.


5. Installing the NVIDIA GPU Operator

The GPU Operator automates driver installation, the device plugin, and the NVIDIA Container Toolkit.

# Add the Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the operator in the kube-system namespace
helm install gpu-operator nvidia/gpu-operator \
  --namespace kube-system \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set migManager.enabled=false \
  --set validator.enabled=true

Post‑install validation

kubectl get pods -n kube-system -l app=nvidia-driver
kubectl get pods -n kube-system -l app=nvidia-device-plugin
kubectl describe ds nvidia-device-plugin-daemonset -n kube-system | grep "GPU"

All GPU nodes should now report nvidia.com/gpu resources.


6. Deploying a Distributed LLM Training Job

6.1 Choosing a Framework

  • PyTorch DDP – Classic, low‑level API.
  • DeepSpeed – Optimized for large models (ZeRO, offloading).
  • Hugging Face Accelerate – Simplifies multi‑GPU/TPU setup.

For this guide we’ll use Hugging Face Accelerate with DeepSpeed integration, because it provides a concise YAML config and works out‑of‑the‑box on EKS.

6.2 Sample Training Script

Save the following as train_llm.py:

# train_llm.py
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from accelerate import Accelerator

def main():
    accelerator = Accelerator()
    model_name = os.getenv("MODEL_NAME", "EleutherAI/gpt-neox-20b")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Load model in fp16
    with accelerator.main_process_first():
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    # Dummy dataset – replace with real data loader
    def dummy_data():
        for _ in range(1000):
            yield {"input_ids": tokenizer.encode("Hello world!", return_tensors="pt")[0]}

    train_dataset = list(dummy_data())

    training_args = TrainingArguments(
        output_dir="/opt/ml/checkpoints",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-5,
        fp16=True,
        deepspeed="ds_config.json",
        report_to="none",
        dataloader_num_workers=2,
        logging_steps=10,
        save_steps=100,
        max_steps=500,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
    )

    trainer.train()

if __name__ == "__main__":
    main()

Create a ds_config.json for DeepSpeed ZeRO‑3 offloading:

{
  "zero_optimization": {
    "stage": 3,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    }
  },
  "gradient_clipping": 1.0,
  "train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": true
  }
}

6.3 Containerizing the Training Code

Dockerfile:

FROM nvcr.io/nvidia/pytorch:24.01-py3

# Install accelerator & deepspeed
RUN pip install --no-cache-dir \
    accelerate==0.34.0 \
    deepspeed==0.14.0 \
    transformers==4.44.0 \
    datasets==3.0.0

WORKDIR /opt/ml
COPY train_llm.py ds_config.json ./

ENTRYPOINT ["python", "train_llm.py"]

Build and push to Amazon ECR:

aws ecr create-repository --repository-name llm-trainer
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=us-east-1
REPO_URI=${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/llm-trainer

aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $REPO_URI
docker build -t $REPO_URI:latest .
docker push $REPO_URI:latest

6.4 Kubernetes Job Manifest

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-train-job
spec:
  completions: 1
  parallelism: 1
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: llm-trainer-sa   # created with IRSA for S3 access
      containers:
        - name: trainer
          image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/llm-trainer:latest
          resources:
            limits:
              nvidia.com/gpu: 8          # one p4d node = 8 A100 GPUs
            requests:
              nvidia.com/gpu: 8
          env:
            - name: MODEL_NAME
              value: "EleutherAI/gpt-neox-20b"
            - name: AWS_DEFAULT_REGION
              value: "us-east-1"
          volumeMounts:
            - name: checkpoint
              mountPath: /opt/ml/checkpoints
      volumes:
        - name: checkpoint
          persistentVolumeClaim:
            claimName: llm-checkpoint-pvc

Notes

  • parallelism: 1 runs a single pod that utilizes all 8 GPUs on the node (single‑node multi‑GPU). For multi‑node training, set parallelism to the number of nodes and use torchrun with --nnodes and --node_rank.
  • The PVC can be backed by FSx for Lustre for high‑throughput checkpoint writes.

6.5 Multi‑Node Example (torchrun)

If you need to span 2 p4d nodes (16 GPUs), adjust the manifest:

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-train-multi
spec:
  completions: 2
  parallelism: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: trainer
          image: <ECR_URI>:latest
          command: ["torchrun"]
          args:
            - "--nnodes=2"
            - "--nproc_per_node=8"
            - "--rdzv_id=llm-rdzv"
            - "--rdzv_backend=c10d"
            - "--rdzv_endpoint=$(MY_POD_IP):29500"
            - "train_llm.py"
          env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          resources:
            limits:
              nvidia.com/gpu: 8
            requests:
              nvidia.com/gpu: 8
          volumeMounts:
            - name: checkpoint
              mountPath: /opt/ml/checkpoints
      volumes:
        - name: checkpoint
          persistentVolumeClaim:
            claimName: llm-checkpoint-pvc

Important: Ensure the EFA security group allows traffic on the chosen rendezvous port (default 29500). Use a headless Service to expose the rendezvous address if you need stable DNS.


7. Performance Optimization Techniques

7.1 Network Tuning

SettingRecommendation
EFAUse EFA-enabled instances (p4d, p5) and enable --rdzv_backend=c10d.
Placement GroupsDeploy GPU nodes in a cluster placement group to keep them on the same rack, reducing latency.
MTUSet network MTU to 9000 (jumbo frames) on the ENIs for NCCL.
NCCL_DEBUGSet NCCL_DEBUG=INFO and NCCL_SOCKET_IFNAME=eth0 to troubleshoot.

7.2 Pod Placement & Affinity

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - {key: "app", operator: In, values: ["llm-trainer"]}
        topologyKey: "kubernetes.io/hostname"
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - {key: "beta.kubernetes.io/instance-type", operator: In, values: ["p4d.24xlarge"]}
  • Pod anti‑affinity forces each trainer pod onto a different node, preventing GPU contention.
  • Node affinity guarantees placement on GPU‑enabled instances.

7.3 Mixed Precision & Gradient Checkpointing

  • Automatic Mixed Precision (AMP) – Already enabled via fp16=True in TrainingArguments. Reduces memory by ~2×.
  • Gradient Checkpointing – Add model.gradient_checkpointing_enable() before training to trade compute for memory, allowing larger batch sizes.
model.gradient_checkpointing_enable()

7.4 Resource Requests & Limits

ResourceRecommended Value
nvidia.com/gpuExact count per node (e.g., 8)
cpu2 * GPU count (e.g., 16) for data loading
memory64Gi per node for large batch buffers
ephemeral-storage100Gi for temporary checkpoint files

Over‑requesting CPU can starve other system components; keep it tight.

7.5 Checkpoint Offloading

DeepSpeed’s ZeRO‑3 offloads parameters and optimizer states to host memory (or even NVMe). For p4d nodes with abundant DRAM, host offload works well. If you need more capacity, mount an NVMe local SSD as a hostPath volume and point offload_param.device to it.

"offload_param": {
  "device": "nvme",
  "pin_memory": true
}

7.6 Autoscaling GPU Nodes

Use Cluster Autoscaler with GPU support:

# configmap for cluster-autoscaler
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-aws-cluster-autoscaler
  namespace: kube-system
data:
  aws-use-static-instance-list: "false"
  balance-similar-node-groups: "true"
  expander: "least-waste"

Add node selector label karpenter.sh/capacity-type=spot to a spot GPU node group for cost savings. Combine with node taints (gpu:NoSchedule) so only training pods can consume spot GPUs.


8. Monitoring, Logging, and Debugging

8.1 Prometheus + Grafana

Deploy the kube‑prometheus‑stack Helm chart:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Add a ServiceMonitor for the NVIDIA device plugin:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-device-plugin
  labels:
    release: kube-prometheus
spec:
  selector:
    matchLabels:
      app: nvidia-device-plugin
  endpoints:
    - port: metrics
      interval: 30s

Create dashboards for:

  • nvidia_gpu_duty_cycle – GPU utilization.
  • nvidia_gpu_memory_total – VRAM usage.
  • nccl_all_reduce_time_seconds_total – NCCL collective latency.

8.2 CloudWatch Container Insights

Enable Container Insights on the EKS cluster:

aws eks update-cluster-config \
  --region $REGION \
  --name $CLUSTER_NAME \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

Set up alarms for:

  • GPU memory > 90% for > 5 mins.
  • CPU throttling > 80%.
  • Network throughput < 70% of EFA capacity.

8.3 Debugging NCCL Issues

Common error: NCCL WARN Call to socket failed: Connection reset by peer. Fixes:

  1. Verify security group allows inbound/outbound traffic on the NCCL port range (default 1‑65535, but you can limit via NCCL_SOCKET_IFNAME).
  2. Ensure EFA is attached to the ENI and the ENI is in the same subnet as other GPU nodes.
  3. Set NCCL_DEBUG_SUBSYS=ALL to get detailed logs.

9. Cost‑Optimization Strategies

TechniqueHow It WorksWhen to Use
Spot InstancesUp to 90% discount on unused EC2 capacity.Non‑critical training, checkpoint‑able jobs.
Mixed‑Instance Node GroupsCombine on‑demand and spot for resilience.Long‑running training where occasional pre‑emptions are acceptable.
Right‑SizingChoose p4de.24xlarge (A100 80 GB) only when model > 30 B parameters.Smaller models (≤ 10 B) can run on g5.12xlarge.
Elastic ScalingScale node group based on GPU utilization > 70%.Variable workloads, e.g., inference bursts.
Savings PlansCommit to a consistent amount of compute (e.g., $/hour).Predictable, steady training pipelines.
Checkpoint OffloadingStore checkpoints on S3 Glacier after each epoch.Reduces EFS/FSx storage costs.

Spot interruption handling – Use --max-retries in DeepSpeed and torch.distributed.elastic to automatically restart the job from the latest checkpoint.


10. Real‑World Case Study: Scaling LLaMA‑13B on EKS

Background
A fintech startup needed to fine‑tune LLaMA‑13B on a proprietary financial corpus (~150 GB). Their goals:

  • Reduce training time from 48 h (single‑node) to < 12 h.
  • Keep AWS spend under $8,000 for a single experiment.
  • Maintain reproducibility across runs.

Solution Architecture

  1. Cluster – 3 × p4d.24xlarge nodes in a cluster placement group (24 GPUs total). Spot capacity was used for two nodes, on‑demand for the master node.
  2. Networking – EFA enabled, MTU set to 9000, security groups opened on port 29500.
  3. Storage – FSx for Lustre (2 TB) for fast checkpoint I/O; S3 for raw data.
  4. Software Stack – NVIDIA GPU Operator, DeepSpeed ZeRO‑3, Hugging Face Accelerate, custom torchrun script.
  5. Autoscaling – Cluster Autoscaler with a target GPU utilization of 80%; added a fourth spot node after the first epoch to boost throughput.
  6. Monitoring – Grafana dashboard showed average NCCL All‑Reduce latency of 0.8 ms, GPU utilization at 92%.

Results

MetricBefore (single‑node)After (EKS multi‑node)
Total wall‑time48 h10.8 h
GPU cost$6,500$7,800 (including spot discounts)
Checkpoint latency30 s5 s
Failure rate01 pre‑empted spot node (auto‑recovered)

The experiment demonstrated that proper placement groups, EFA, and DeepSpeed ZeRO‑3 can cut training time by > 75% while staying within a modest budget.


11. Best‑Practice Checklist

  • Cluster Setup
    • Use EFA‑enabled GPU instances (p4d, p5).
    • Deploy nodes in a cluster placement group.
    • Enable IRSA for secure S3/EFS access.
  • GPU Operator
    • Verify nvidia.com/gpu appears in node allocatable.
    • Enable validator to catch driver mismatches.
  • Networking
    • Set MTU to 9000 on ENIs.
    • Open NCCL ports (default 29500) in security groups.
  • Pod Specs
    • Pin exact GPU count in limits and requests.
    • Add node affinity for GPU instance types.
    • Use anti‑affinity to avoid GPU contention.
  • Training Configuration
    • Enable AMP (fp16=True).
    • Turn on gradient checkpointing if memory‑bound.
    • Use DeepSpeed ZeRO‑3 with host/offload to reduce VRAM pressure.
  • Storage
    • Store checkpoints on FSx for Lustre or high‑throughput EBS.
    • Archive old checkpoints to S3 Glacier.
  • Monitoring
    • Install Prometheus + Grafana with NVIDIA metrics.
    • Set CloudWatch alarms for GPU utilization and NCCL latency.
  • Autoscaling & Cost
    • Enable Cluster Autoscaler with GPU metrics.
    • Mix spot and on‑demand nodes.
    • Apply Savings Plans for predictable workloads.
  • Reliability
    • Use torch.distributed.elastic for automatic restart.
    • Persist checkpoints frequently (every N steps).

Conclusion

Training and serving large language models at scale is no longer the exclusive domain of on‑premise GPU farms. Amazon EKS, combined with NVIDIA’s GPU Operator, Elastic Fabric Adapter, and modern distributed training frameworks, offers a robust, flexible, and cost‑effective platform for LLM workloads.

By following the architecture, provisioning steps, and optimization techniques outlined in this guide, you can:

  • Achieve low‑latency GPU‑to‑GPU communication via EFA and placement groups.
  • Maximize GPU utilization through mixed‑precision, gradient checkpointing, and DeepSpeed’s ZeRO‑3.
  • Scale elastically while controlling spend via spot instances and autoscaling.
  • Maintain observability and reliability with Prometheus, CloudWatch, and fault‑tolerant job orchestration.

Whether you are fine‑tuning a 13 B model or training a 200 B LLM, the principles here provide a solid foundation for building production‑grade pipelines on AWS. Happy training!


Resources