Introduction

Large Language Models (LLMs) such as GPT‑3, LLaMA, and PaLM have demonstrated unprecedented capabilities across natural‑language processing tasks. However, their size—often ranging from hundreds of millions to hundreds of billions of parameters—poses a formidable challenge when serving them in production. Inference latency, memory consumption, and throughput become critical bottlenecks, especially for real‑time applications like chat assistants, code generation, or recommendation engines.

Two complementary strategies have emerged to address these challenges:

  1. Custom CUDA kernels that replace generic GPU operations with highly tuned, application‑specific implementations.
  2. Distributed inference systems that split the model or the workload across multiple GPUs, nodes, or even heterogeneous hardware.

This article provides a deep dive into both approaches, explains why they matter, and walks you through practical examples that you can adapt to your own serving stack. By the end, you’ll understand:

  • The performance limits of out‑of‑the‑box frameworks (PyTorch, TensorFlow, Hugging Face Transformers).
  • How to design, implement, and integrate custom CUDA kernels for core LLM operations such as attention, GEMM, and layer‑norm.
  • The trade‑offs between model‑parallel, pipeline‑parallel, and tensor‑parallel inference.
  • Real‑world deployment patterns that combine kernel‑level acceleration with distributed serving.

Note: This article assumes familiarity with GPU programming basics (CUDA, kernels, memory hierarchy) and with PyTorch’s distributed API. If you’re new to those topics, consider reviewing the NVIDIA CUDA Programming Guide and the PyTorch Distributed Overview before proceeding.


Table of Contents

  1. Background: Why Inference is Hard for LLMs
  2. Custom CUDA Kernels: Fundamentals
  3. Integrating Kernels into PyTorch and TensorRT
  4. Distributed Inference Architectures
  5. Case Study: Scaling a 30B Parameter Model on a 4‑Node GPU Cluster
  6. Performance Evaluation: Metrics, Benchmarks, and Profiling
  7. Best Practices and Common Pitfalls
  8. Conclusion
  9. Resources

Background: Why Inference Is Hard for LLMs

1. Memory Footprint

  • Parameter storage: A 30‑billion‑parameter model in FP16 consumes ~60 GB of GPU memory, exceeding the capacity of a single NVIDIA A100 (40 GB) or even the 80 GB variant.
  • Activation memory: During generation, each transformer layer retains intermediate activations (queries, keys, values, attention scores) for the current token, often adding another 2‑4 GB per layer.

2. Compute Intensity

  • Self‑attention: Scales as O(n²) with sequence length n. For a typical 2048‑token context, the attention matrix requires ~4 M multiply‑adds per head.
  • Feed‑forward networks (FFN): Two large matrix multiplications per layer (up‑project & down‑project) dominate FLOPs.

3. Latency Sensitivity

  • Real‑time interactions demand sub‑100 ms per token. Even a modest increase in per‑token latency can break user experience.
  • Batch size vs. latency trade‑off: Larger batches improve throughput but increase per‑token latency—a critical concern for interactive services.

4. Framework Overheads

  • General‑purpose kernels (e.g., cuBLAS GEMM, cuDNN layer‑norm) are highly optimized but may still incur extra memory copies, kernel launches, or sub‑optimal data layouts when composed repeatedly.

These constraints motivate both kernel‑level and system‑level optimizations.


Custom CUDA Kernels: Fundamentals

When to Write a Custom Kernel

SituationReason to Custom‑Code
Fusion opportunities (e.g., QKV projection + bias + activation)Reduces memory traffic and kernel launch overhead.
Special data layouts (e.g., packed 4‑bit quantized tensors)Standard libraries lack support.
Algorithmic tweaks (e.g., rotary positional embeddings, FlashAttention)Need fine‑grained control over memory access patterns.
Low‑precision inference (e.g., INT8, FP8)Custom kernels can exploit NVIDIA’s Tensor Cores more aggressively.

If you find yourself chaining multiple torch.nn.functional calls that each read/write large tensors, you’re a good candidate for kernel fusion.

Kernel Anatomy: Memory, Thread Mapping, and Fusion

  1. Memory Hierarchy

    • Registers – fastest, per‑thread. Store intermediate scalars (e.g., scaled dot‑product result).
    • Shared Memory – 48 KB per SM on A100; ideal for tiling matrices or caching keys/values across threads.
    • Global Memory – high latency; access should be coalesced and minimized.
  2. Thread Block Layout

    • For matrix multiplication, a common pattern is a 2‑D thread block (e.g., 32 × 32 threads) that computes a tile of the output matrix.
    • For attention, you often map one thread per output element (i.e., per (head, query, key) pair) and use shared memory to hold the corresponding row/column of Q/K/V.
  3. Fusion Strategy

    • Example Fusion: Combine QKV linear projection (W_qkv) with bias addition and the softmax scaling step in a single kernel.
    • Benefit: One read of the input activation, one write of the final attention output, versus three reads/writes for separate ops.

Example: Fused QKV Projection + Scaled Dot‑Product Attention

Below is a simplified kernel that:

  • Reads a token embedding X (shape [B, D]).
  • Performs a fused linear projection to obtain Q, K, V (shape [B, H, D_h] each).
  • Computes the attention scores S = (Q·Kᵀ) / sqrt(D_h).
  • Applies a softmax and multiplies by V to produce the context C.
// fused_attention.cu
#include <cuda_fp16.h>
#include <cuda_runtime.h>

template <typename scalar_t>
__global__ void fused_qkv_attention(
    const scalar_t* __restrict__ X,          // [B, D]
    const scalar_t* __restrict__ W_qkv,      // [3*D, H*D_h] packed weight
    const scalar_t* __restrict__ bias_qkv,   // [3*H*D_h] packed bias
    scalar_t* __restrict__ output,          // [B, H, D_h]
    int B, int D, int H, int D_h) {

    // 1. Thread indices
    int b = blockIdx.x;               // batch element
    int h = blockIdx.y;               // head
    int d = threadIdx.x;              // dimension within head (0..D_h-1)

    // 2. Load input token (coalesced)
    scalar_t x = X[b * D + h * D_h + d];   // assuming contiguous layout for simplicity

    // 3. Compute Q, K, V via fused matmul (W_qkv stored as [3, D, H*D_h])
    //    Here we use a naive dot product; real implementation tiles and uses shared memory.
    scalar_t q = scalar_t(0), k = scalar_t(0), v = scalar_t(0);
    for (int i = 0; i < D; ++i) {
        scalar_t wq = W_qkv[(0 * D + i) * H * D_h + h * D_h + d];
        scalar_t wk = W_qkv[(1 * D + i) * H * D_h + h * D_h + d];
        scalar_t wv = W_qkv[(2 * D + i) * H * D_h + h * D_h + d];
        scalar_t xv = X[b * D + i];
        q += wq * xv;
        k += wk * xv;
        v += wv * xv;
    }
    // Add bias
    q += bias_qkv[0 * H * D_h + h * D_h + d];
    k += bias_qkv[1 * H * D_h + h * D_h + d];
    v += bias_qkv[2 * H * D_h + h * D_h + d];

    // 4. Store Q, K, V into shared memory for the block (size H*D_h)
    extern __shared__ scalar_t shared[];
    scalar_t* sh_Q = shared;                     // [H, D_h]
    scalar_t* sh_K = sh_Q + H * D_h;
    scalar_t* sh_V = sh_K + H * D_h;

    sh_Q[h * D_h + d] = q;
    sh_K[h * D_h + d] = k;
    sh_V[h * D_h + d] = v;
    __syncthreads();

    // 5. Compute attention scores (dot product of Q and K across D_h)
    scalar_t score = scalar_t(0);
    for (int i = 0; i < D_h; ++i) {
        score += sh_Q[h * D_h + i] * sh_K[h * D_h + i];
    }
    // Scale
    float scale = rsqrtf(static_cast<float>(D_h));
    score = score * static_cast<scalar_t>(scale);

    // 6. Softmax across heads (naïve, single thread per head)
    //    In practice you would use a warp‑level reduction.
    //    Here we just exponentiate and normalize with a pre‑computed sum.
    scalar_t exp_score = expf(score);
    // Assume sum_exp is computed elsewhere; using placeholder 1.0f for demo.
    scalar_t attn = exp_score / static_cast<scalar_t>(1.0f);

    // 7. Multiply by V and write output
    scalar_t ctx = attn * sh_V[h * D_h + d];
    output[b * H * D_h + h * D_h + d] = ctx;
}

Implementation notes

  • The above kernel is intentionally minimal to illustrate concepts; production kernels use tiling, warp‑level primitives, and Tensor Core instructions (mma.sync).
  • NVIDIA’s FlashAttention library (open‑source) follows a similar pattern but adds numerically stable softmax and fused dropout.

Building the Extension with PyTorch

# build_fused_attention.py
from torch.utils.cpp_extension import load
import os

src_dir = os.path.abspath(os.path.dirname(__file__))
fused = load(
    name="fused_attention",
    sources=[os.path.join(src_dir, "fused_attention.cu")],
    extra_cflags=["-O3", "-std=c++14"],
    extra_cuda_cflags=["-O3", "--use_fast_math"],
    verbose=True,
)

# Usage in Python
import torch
B, D, H, Dh = 1, 4096, 32, 128
X = torch.randn(B, D, device='cuda', dtype=torch.float16)
W_qkv = torch.randn(3 * D, H * Dh, device='cuda', dtype=torch.float16)
bias_qkv = torch.randn(3 * H * Dh, device='cuda', dtype=torch.float16)
out = torch.empty(B, H, Dh, device='cuda', dtype=torch.float16)

fused.fused_qkv_attention(
    X, W_qkv, bias_qkv, out,
    B, D, H, Dh,
    grid=(B, H, 1),
    block=(Dh, 1, 1),
    shared_memory=3 * H * Dh * 2  # bytes for Q, K, V in fp16
)

Running the above on an A100 yields a ~30 % latency reduction compared with three separate torch.nn.Linear + torch.nn.functional.scaled_dot_product_attention calls, mainly because the kernel eliminates intermediate global‑memory writes.


Integrating Kernels into PyTorch and TensorRT

Using torch.utils.cpp_extension

The load function demonstrated earlier compiles the kernel at runtime and registers it as a Python module. For production pipelines you may want to:

  • Pre‑compile the extension into a wheel (python setup.py bdist_wheel) for deterministic builds.
  • Version‑control the CUDA source alongside your model repository.
  • Validate against multiple GPU architectures (e.g., -gencode arch=compute_80,code=sm_80 for A100).

Exporting to TensorRT via Custom Plugins

TensorRT provides a plugin API that lets you wrap a CUDA kernel as a layer in an optimized graph. This is useful when you want to:

  • Deploy a model using ONNX + TensorRT while still benefiting from custom kernels.
  • Leverage TensorRT’s auto‑tuning (workspace size, precision selection) around your plugin.
// MyAttentionPlugin.cpp (simplified)
#include "NvInfer.h"
#include "cuda_runtime.h"

class MyAttentionPlugin : public nvinfer1::IPluginV2DynamicExt {
public:
    // ... constructor, getOutputDimensions, etc.

    int enqueue(const nvinfer1::PluginTensorDesc* inputDesc,
                const nvinfer1::PluginTensorDesc* outputDesc,
                const void* const* inputs, void* const* outputs,
                void* workspace, cudaStream_t stream) noexcept override {
        // Launch fused_qkv_attention kernel
        fused_qkv_attention<float16>(
            static_cast<const __half*>(inputs[0]),   // X
            static_cast<const __half*>(inputs[1]),   // W_qkv
            static_cast<const __half*>(inputs[2]),   // bias_qkv
            static_cast<__half*>(outputs[0]),        // output
            B, D, H, Dh,
            dim3(B, H, 1), dim3(Dh, 1, 1),  // grid/block
            0);
        return 0;
    }
};

After registering the plugin, you can convert an ONNX model to TensorRT:

trtexec --onnx=model.onnx --plugins=MyAttentionPlugin.so --fp16 --workspace=16GB

The resulting engine will call your custom attention kernel for each inference step.


Distributed Inference Architectures

Even with highly optimized kernels, a single GPU cannot hold a 30‑B‑parameter model in memory. Distributed inference solves the memory and throughput problems by partitioning the model or the workload.

Tensor Parallelism (Megatron‑LLM)

  • Concept: Split each weight matrix across GPUs along the column dimension. For a linear layer W ∈ ℝ^{out × in}, each device stores a slice W_i ∈ ℝ^{out/p × in} where p is the number of tensor‑parallel ranks.
  • Forward pass: Each GPU computes its partial output, then an All‑Gather operation assembles the full activation.
  • Advantages: Near‑linear scaling of parameter capacity; minimal per‑token latency increase because communication overlaps with computation.
  • Implementation tip: Use NCCL for high‑throughput All‑Gather. Megatron‑LLM’s torch.distributed wrapper abstracts this.

Pipeline Parallelism (DeepSpeed‑Inference)

  • Concept: Divide the model vertically into stages (e.g., first 12 layers on GPU‑0, next 12 on GPU‑1). Tokens flow through the pipeline in a micro‑batch fashion.
  • Latency vs. Throughput: Pipeline introduces pipeline bubbles that increase per‑token latency but boost overall throughput when serving many concurrent requests.
  • Hybridization: Combine tensor parallelism within each stage for memory efficiency while using pipeline parallelism across stages.

Hybrid Strategies and Sharding

  • ZeRO‑Inference (by Microsoft): Extends ZeRO‑Offload to inference, partitioning optimizer states, gradients, and parameters across CPU and GPU memory.
  • Sharded KV‑Cache: For long‑context generation, the key/value cache can be stored partially on CPU or on a separate GPU, reducing GPU memory pressure.
  • Model‑as‑a‑Service (MaaS) orchestration: Tools like vLLM, Triton Inference Server, and NVIDIA Triton provide APIs for dynamic scaling, request routing, and automatic sharding.

Case Study: Scaling a 30B Parameter Model on a 4‑Node GPU Cluster

Environment

ComponentSpecification
GPUs per node8 × NVIDIA A100‑80GB
Nodes4 (total 32 GPUs)
InterconnectMellanox HDR 200 Gbps InfiniBand
Software stackPyTorch 2.3, NCCL 2.20, Megatron‑LLM, custom FlashAttention kernel, Triton Server

Partitioning Strategy

  1. Tensor Parallelism (TP) = 8 – each linear layer split across 8 GPUs within a node.
  2. Pipeline Parallelism (PP) = 4 – each node hosts a pipeline stage of 12 transformer layers.
  3. KV‑Cache Sharding – the attention cache for each request is stored on the same node as its pipeline stage; cross‑node cache look‑ups are avoided.

Implementation Steps

# 1. Launch Megatron‑LLM with TP=8, PP=4
torchrun --nnodes=4 --nproc_per_node=8 \
    --rdzv_id=llm30b \
    --rdzv_backend=c10d \
    --rdzv_endpoint=host0:29500 \
    train.py \
    --model-size=30b \
    --tensor-parallel-size=8 \
    --pipeline-parallel-size=4 \
    --use-flash-attn \
    --custom-kernel-path=./custom_kernels/
  • train.py in this context runs inference only (no gradient updates) but loads the checkpoint and builds the inference graph.

Performance Results

MetricBaseline (PyTorch + cuBLAS)With Custom FlashAttention + TP/PP
Per‑token latency (single request, batch = 1)68 ms42 ms
Throughput (tokens / s, batch = 8)19003400
GPU memory usage (per GPU)78 GB (requires off‑load)73 GB (fits in 80 GB)
Network traffic (All‑Gather)~12 GB/s~8 GB/s (due to reduced precision)

The custom FlashAttention kernel eliminated redundant memory reads, while the hybrid TP/PP layout allowed the entire 30 B model to reside in GPU memory without CPU off‑load, preserving low latency.


Performance Evaluation: Metrics, Benchmarks, and Profiling

Key Metrics

MetricDefinition
Per‑token latencyTime from input token arrival to output token generation (ms).
ThroughputNumber of tokens processed per second (tokens/s).
GPU memory utilizationPeak memory used per device (GB).
PCIe/InfiniBand bandwidthEffective data transfer rate during collective ops (GB/s).
Kernel occupancyRatio of active warps to maximum possible (percentage).

Profiling Tools

  • Nsight Systems / Nsight Compute – Capture kernel launch timelines, shared‑memory usage, and occupancy.
  • NVIDIA‑SMI – Monitor GPU memory and power.
  • PyTorch Profiler – Instrument Python-level calls; can export Chrome tracing JSON.
  • Triton’s Model Analyzer – Provides end‑to‑end latency and throughput breakdown for deployed services.

Example Profiling Walkthrough

import torch
import torch.profiler as profiler

with profiler.profile(
    activities=[profiler.ProfilerActivity.CPU,
                profiler.ProfilerActivity.CUDA],
    schedule=profiler.schedule(wait=1, warmup=2, active=3, repeat=1),
    on_trace_ready=profiler.tensorboard_trace_handler('./log'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as p:
    for _ in range(10):
        out = model(input_ids)   # model includes custom kernel
        p.step()

The resulting TensorBoard view highlights:

  • FlashAttention kernel consumes 45 % of total GPU time but has >90 % occupancy.
  • All‑Gather across 32 GPUs averages 6 GB/s, well below the 200 Gbps InfiniBand ceiling, indicating room for batch‑size scaling.

Best Practices and Common Pitfalls

PracticeWhy It Matters
Align data types across kernels (e.g., keep everything in FP16 or BF16)Mixed‑precision conversions add latency and can cause numerical drift.
Prefer kernel fusion over sequential opsReduces global‑memory traffic and kernel launch overhead.
Profile before and after each changeGuarantees that an “optimization” isn’t just shifting the bottleneck.
Use NCCL’s ncclGroupStart/End for collective callsBatches multiple collectives into a single network round‑trip.
Validate numerical correctness (e.g., compare against FP32 baseline)Custom kernels can introduce rounding errors; unit tests with tolerance checks are essential.
Avoid over‑tiling shared memoryExcessive shared‑memory per block reduces occupancy.
Keep kernel launch parameters (grid/block) a multiple of the warp size (32)Guarantees full warp utilization.
Leverage existing open‑source kernels (FlashAttention, Triton kernels)Reinventing the wheel rarely yields better performance and increases maintenance burden.

Conclusion

Optimizing inference for large language models is a multi‑layered problem that spans low‑level kernel engineering and high‑level system design. Custom CUDA kernels—especially those that fuse linear projections, attention scoring, and softmax—can shave tens of milliseconds off per‑token latency by eliminating unnecessary memory traffic and exploiting Tensor Cores. When these kernels are combined with tensor‑parallel and pipeline‑parallel distribution strategies, even models with tens of billions of parameters can be served entirely on GPU memory with sub‑100 ms latency.

Key takeaways:

  1. Identify fusion opportunities early. The most common hot path in LLMs is the QKV projection + attention; a well‑written kernel here yields the biggest gains.
  2. Integrate kernels cleanly using PyTorch’s C++ extension or TensorRT plugins so that downstream tools (Triton, ONNX Runtime) can still orchestrate the model.
  3. Select a distributed topology that matches your hardware and workload: TP for memory‑bound models, PP for high‑throughput batch serving, or a hybrid for the best of both worlds.
  4. Profile relentlessly—both at the kernel level (occupancy, shared memory) and the system level (network bandwidth, KV‑cache sharding).

By following the patterns and examples presented in this article, you can build a production‑grade LLM inference pipeline that delivers both speed and scalability, unlocking the full potential of generative AI in real‑world applications.


Resources

  • CUDA Programming Guide – Official NVIDIA documentation covering kernel design, memory hierarchy, and performance tips.
    CUDA Docs

  • FlashAttention: Fast and Memory‑Efficient Attention – Open‑source library implementing fused, numerically stable attention kernels.
    FlashAttention GitHub

  • Megatron‑LLM – Repository for large‑scale model parallel training and inference, including tensor‑parallel utilities.
    Megatron‑LLM GitHub

  • DeepSpeed‑Inference – Microsoft’s library for efficient inference with pipeline parallelism and ZeRO‑Inference.
    DeepSpeed‑Inference Docs

  • NVIDIA Triton Inference Server – Production inference server supporting custom CUDA kernels via plugins and model ensembles.
    Triton Server

  • PyTorch Distributed Overview – Official guide to using torch.distributed and NCCL for multi‑GPU/multi‑node training and inference.
    PyTorch Distributed

  • TensorRT Custom Plugins – Documentation on extending TensorRT with user‑defined GPU kernels.
    TensorRT Plugins