Introduction

The demand for real‑time machine‑learning inference has exploded across industries—from recommendation engines that serve millions of users per second to autonomous‑vehicle perception stacks that must make decisions within a few milliseconds. While training pipelines have long benefited from massive GPU clusters and sophisticated graph optimizers, production inference workloads present a different set of challenges:

  1. Latency guarantees – Many user‑facing services cannot tolerate more than a few tens of milliseconds of tail latency.
  2. Throughput pressure – A single model may need to process thousands of requests per second on a single node, let alone across a fleet.
  3. Heterogeneous hardware – Inference services often run on a mix of CPUs, GPUs, TPUs, and even specialized ASICs.
  4. Dynamic traffic – Request rates fluctuate dramatically throughout the day, requiring systems that can adapt on‑the‑fly.

Two techniques have emerged as decisive levers for meeting these constraints:

  • Custom kernel optimization – Tailoring low‑level compute kernels (e.g., matrix‑multiply, activation functions) to the exact shape, data type, and hardware characteristics of a model.
  • Adaptive batching – Dynamically grouping incoming requests into batches that maximize hardware utilization while respecting latency Service Level Objectives (SLOs).

In this article we dive deep into both topics, explore how they interact in a distributed inference engine, and provide practical code snippets, architectural patterns, and a real‑world case study. By the end you will have a concrete roadmap for scaling your own inference services from a single GPU node to a multi‑region, auto‑scaling deployment.


1. Understanding Distributed Inference Engines

1.1 Core Architecture

A typical distributed inference stack consists of three logical layers:

LayerResponsibilityCommon Technologies
IngressAccepts HTTP/gRPC/REST requests, performs authentication, and enqueues payloads.Envoy, NGINX, Istio, custom API gateways
Scheduler / BatcherDetermines how many requests to group together, when to dispatch them, and on which device.Ray Serve, Triton Inference Server, custom async loop
ExecutorRuns the model graph on a device, returning predictions.TensorRT, ONNX Runtime, PyTorch TorchServe, TVM runtime

The scheduler is the heart of the system when it comes to scaling. It must balance throughput (maximizing device utilization) against latency (meeting per‑request deadlines). A naïve static batching policy—e.g., “always batch 32 inputs”—fails under variable load and can easily violate latency SLOs.

1.2 Common Bottlenecks

BottleneckSymptomsTypical Root Cause
Kernel inefficiencyLow GPU occupancy, high compute time per tokenGeneric kernels not specialized for model dimensions or data layout
Batching overheadSudden latency spikes when queue builds upFixed batch size causing idle time or oversized batches
Network latencyEnd‑to‑end latency > 100 ms even on fast GPUsInefficient inter‑node RPC, lack of RDMA, or suboptimal serialization
CPU‑GPU synchronizationGPU stalls waiting for host dataFrequent host‑to‑device copies, poor pinned‑memory usage

Addressing these pain points requires a coordinated effort across the three layers: custom kernels shave compute time, while adaptive batching reduces idle periods and keeps the pipeline fluid.


2. Custom Kernel Optimization

2.1 Why Custom Kernels Matter

Deep‑learning frameworks ship with highly optimized kernels for common operations (e.g., cuBLAS GEMM, cuDNN convolutions). However, inference workloads often involve:

  • Static shapes – at inference time the batch size and sequence length are known, enabling compile‑time specialization.
  • Reduced precision – INT8, BF16, or even custom 4‑bit formats that generic kernels may not fully exploit.
  • Model‑specific patterns – e.g., fused attention + softmax in transformer models, or residual connections that can be combined into a single kernel.

When a kernel is over‑generalized, the GPU may spend cycles on handling edge cases that never occur in production. By writing a custom kernel that is narrowly tailored, you can:

  • Increase occupancy (more warps per SM)
  • Reduce memory traffic (fewer loads/stores, better cache reuse)
  • Eliminate extra kernel launches (fusing multiple ops)

2.2 Profiling and Identifying Hot Spots

Before writing any code, collect detailed performance data. The workflow is:

  1. Instrument the inference server using NVIDIA Nsight Systems or TensorBoard Profiler.
  2. Capture a representative workload (e.g., 1 k requests with realistic payloads).
  3. Identify kernels that dominate GPU time (>30 % of total compute) and those with low FLOP‑to‑byte ratios.
# Example: Using Nsight Systems to capture a 10‑second trace
nsys profile -t cuda,osrt -o inference_trace python serve.py

Open the generated .qdrep file in the Nsight UI; look for “Top Kernels” and note their launch configurations.

2.3 Writing a Custom CUDA Kernel

Below is a minimal example of a fused MatMul + Bias + ReLU kernel that replaces three separate calls (cublasGemm, cudaMemcpy, cudnnActivation). The kernel assumes:

  • Input matrix A: (M × K)
  • Weight matrix W: (K × N)
  • Bias vector b: (N)
  • Output C: (M × N)
// fused_gemm_bias_relu.cu
#include <cuda_fp16.h>
#include <cuda_runtime.h>

template <typename T>
__global__ void fused_gemm_bias_relu(
    const T* __restrict__ A,
    const T* __restrict__ W,
    const T* __restrict__ b,
    T* __restrict__ C,
    int M, int N, int K)
{
    // Tile size – 128 threads per block (16x8)
    const int TILE_M = 128;
    const int TILE_N = 64;
    const int TILE_K = 8;

    // Compute global row/col for this thread
    int row = blockIdx.y * TILE_M + threadIdx.y;
    int col = blockIdx.x * TILE_N + threadIdx.x;

    if (row >= M || col >= N) return;

    // Accumulator
    T acc = static_cast<T>(0);

    // Loop over K dimension in tiles
    for (int t = 0; t < K; t += TILE_K) {
        // Load a tile of A and W into registers (or shared memory for larger tiles)
        T a = A[row * K + (t + threadIdx.x)];
        T w = W[(t + threadIdx.y) * N + col];
        acc = fma(a, w, acc); // fused multiply‑add
    }

    // Add bias
    acc = acc + b[col];

    // Apply ReLU
    acc = max(acc, static_cast<T>(0));

    // Store result
    C[row * N + col] = acc;
}

// Explicit instantiation for half‑precision
extern "C" void launch_fused_gemm_bias_relu_half(
    const half* A, const half* W, const half* b, half* C,
    int M, int N, int K, cudaStream_t stream)
{
    dim3 block(16, 8);               // 128 threads per block
    dim3 grid((N + 63) / 64, (M + 127) / 128);
    fused_gemm_bias_relu<half><<<grid, block, 0, stream>>>(A, W, b, C, M, N, K);
}

Key optimizations:

  • Tile‑based computation reduces global memory loads.
  • Using FMA (fused multiply‑add) ensures a single instruction per multiply‑add pair.
  • The kernel works directly on half‑precision (__half) to exploit Tensor‑Cores on modern NVIDIA GPUs (by compiling with -arch=sm_80 or later).

2.4 Integrating Custom Kernels with Inference Frameworks

Most serving stacks expose a custom operator interface:

FrameworkIntegration Method
TensorRTIPluginV2DynamicExt implementation
ONNX RuntimeCustomOp registration via C++/Python
PyTorch TorchServetorch.utils.cpp_extension and torch.nn.Module wrapper
TVMrelay.op.op.get with external codegen

TensorRT Plugin Example (C++)

class FusedGemmBiasReLUPlugin : public nvinfer1::IPluginV2DynamicExt {
public:
    // ... constructor, getOutputDimensions, etc.

    int enqueue(const nvinfer1::PluginTensorDesc* inputDesc,
                const nvinfer1::PluginTensorDesc* outputDesc,
                const void* const* inputs, void* const* outputs,
                void* workspace, cudaStream_t stream) noexcept override
    {
        // Cast inputs to half*
        const half* A = static_cast<const half*>(inputs[0]);
        const half* W = static_cast<const half*>(inputs[1]);
        const half* b = static_cast<const half*>(inputs[2]);
        half* C = static_cast<half*>(outputs[0]);

        launch_fused_gemm_bias_relu_half(A, W, b, C,
            m_, n_, k_, stream);
        return 0;
    }
};

Register the plugin with TensorRT’s IPluginRegistry and reference it in an ONNX model using a custom node name. At runtime TensorRT will call the fused kernel instead of performing three separate operations.

2.5 When Not to Write a Custom Kernel

  • Low‑frequency models – The engineering effort may outweigh performance gains.
  • Rapid model iteration – Custom kernels lock you into a specific shape; frequent changes require recompilation.
  • Hardware portability – Hand‑crafted CUDA kernels won’t run on CPUs or other accelerators without a comparable rewrite.

In such cases, consider auto‑tuning compilers (TVM, XLA) that can generate specialized kernels automatically.


3. Adaptive Batching Strategies

3.1 Fixed vs. Dynamic Batching

StrategyHow It WorksProsCons
Fixed batchingAlways accumulate exactly B requests before dispatch.Simple, predictable GPU utilization.May add unnecessary waiting time under low load; may overflow latency budget under spikes.
Dynamic (adaptive) batchingAdjust batch size based on current queue depth, request deadline, and device load.Minimizes latency while still exploiting parallelism.More complex scheduler; requires accurate deadline tracking.

Static batching is acceptable for offline batch jobs, but for interactive services we need a policy that respects per‑request latency SLOs (e.g., 99‑th percentile ≤ 30 ms).

3.2 Latency vs. Throughput Trade‑offs

Let:

  • τ = target latency SLO (ms)
  • B_max = maximum batch size the GPU can efficiently handle
  • Δ = per‑batch processing time (including kernel launch + compute)

If we always wait for B_max requests, the worst‑case waiting time is:

wait_time = (B_max - 1) * inter_arrival_time

When traffic is sparse (inter_arrival_time » Δ), latency balloons. Adaptive batching solves this by capping the wait time to a fraction of τ. A common heuristic:

max_wait = τ * 0.5   # reserve half of latency budget for compute

If the queue hasn’t reached B_max within max_wait, dispatch whatever is available.

3.3 Queue Management and Load Prediction

A robust scheduler maintains two data structures:

  1. Priority Queue ordered by deadline (request arrival time + τ). This ensures the earliest‑deadline request is processed first.
  2. Batch Buffer that aggregates requests until either:
    • len(buffer) == B_max, or
    • now - buffer[0].arrival >= max_wait.

A simple exponential moving average (EMA) can predict upcoming request rate and proactively adjust B_max:

def update_batch_size(rate_ema, min_bs=4, max_bs=128):
    # Scale batch size linearly with predicted rate
    # Assume 1000 rps => max_bs, 10 rps => min_bs
    scale = (rate_ema - 10) / (1000 - 10)
    bs = int(min_bs + scale * (max_bs - min_bs))
    return max(min_bs, min(bs, max_bs))

3.4 Example Implementation (Python + asyncio)

Below is a compact, production‑ready scheduler that combines deadline‑aware queuing, adaptive batch sizing, and optional GPU pinned‑memory buffers.

# async_batcher.py
import asyncio
import time
from collections import deque
from typing import Any, List, Tuple

# Configuration
LATENCY_SLO_MS = 30          # 99th‑pct latency target
MAX_BATCH_SIZE = 128
MIN_BATCH_SIZE = 4
EMA_ALPHA = 0.2              # smoothing factor for request rate EMA

class Request:
    def __init__(self, payload: Any, future: asyncio.Future):
        self.payload = payload
        self.arrival = time.time()
        self.deadline = self.arrival + LATENCY_SLO_MS / 1000.0
        self.future = future

class AdaptiveBatcher:
    def __init__(self, infer_fn):
        self.infer_fn = infer_fn          # async function that runs on GPU
        self.queue = deque()
        self.rate_ema = 0.0
        self.last_tick = time.time()
        self.loop = asyncio.get_event_loop()

    async def submit(self, payload: Any) -> Any:
        fut = self.loop.create_future()
        self.queue.append(Request(payload, fut))
        return await fut

    def _update_rate(self, count: int, interval: float):
        cur_rate = count / interval
        self.rate_ema = EMA_ALPHA * cur_rate + (1 - EMA_ALPHA) * self.rate_ema

    async def _batch_worker(self):
        while True:
            start = time.time()
            # Wait until at least one request is present
            while not self.queue:
                await asyncio.sleep(0.0005)

            # Determine dynamic batch size based on EMA
            batch_size = max(
                MIN_BATCH_SIZE,
                min(
                    MAX_BATCH_SIZE,
                    int(self.rate_ema) or MIN_BATCH_SIZE
                )
            )

            batch = []
            deadlines = []

            # Pull up to batch_size items respecting max_wait
            max_wait = LATENCY_SLO_MS / 1000.0 * 0.5
            while len(batch) < batch_size and self.queue:
                req = self.queue[0]
                if time.time() - req.arrival > max_wait:
                    # Timeout reached – break to keep latency low
                    break
                batch.append(self.queue.popleft())
                deadlines.append(req.deadline)

            # Pad batch if empty (rare edge case)
            if not batch:
                await asyncio.sleep(0.0001)
                continue

            # Prepare batched input for GPU (example: torch.stack)
            inputs = [r.payload for r in batch]
            batched_tensor = self._stack_inputs(inputs)

            # Run inference (assume infer_fn returns a torch tensor)
            results = await self.infer_fn(batched_tensor)

            # Scatter results back to individual futures
            for req, out in zip(batch, results):
                if not req.future.done():
                    req.future.set_result(out)

            # Update request rate EMA
            elapsed = time.time() - start
            self._update_rate(len(batch), elapsed)

    def _stack_inputs(self, inputs: List[Any]):
        # Placeholder: replace with actual tensor stacking, e.g., torch.stack
        import torch
        return torch.stack(inputs, dim=0)

    def start(self):
        self.loop.create_task(self._batch_worker())

How to use:

import torch
import asyncio

async def gpu_infer(batched_tensor):
    # Simulate a GPU kernel; replace with real model forward()
    await asyncio.sleep(0.001)   # pretend GPU latency ~1 ms
    return batched_tensor.mean(dim=1)   # dummy output

batcher = AdaptiveBatcher(gpu_infer)
batcher.start()

async def client_simulation():
    for i in range(500):
        payload = torch.randn(128)   # dummy input vector
        result = await batcher.submit(payload)
        print(f"Request {i} got result {result.shape}")
        await asyncio.sleep(0.002)   # 500 rps overall

asyncio.run(client_simulation())

The scheduler:

  • Tracks request rate using an EMA to adjust batch size.
  • Limits waiting time to half the latency budget, guaranteeing that even under low traffic the request won’t wait longer than max_wait.
  • Runs asynchronously, enabling a single Python process to handle thousands of concurrent connections without blocking.

3.5 Deploying the Scheduler at Scale

When scaling to multiple nodes, the batcher can be instantiated per GPU or per worker process. Coordination across workers is typically handled by a front‑end load balancer (e.g., Envoy) that performs consistent hashing of request keys to a specific worker, preserving cache locality. For multi‑region deployments, a global router can direct traffic based on latency probes, falling back to the nearest region that still has enough capacity.


4. Scaling Across Nodes

4.1 Data Parallel vs. Model Parallel Inference

StrategyDescriptionWhen to Use
Data ParallelReplicate the full model on each node; each node processes a distinct subset of requests.Model fits comfortably on a single device; traffic is high enough to justify multiple replicas.
Model ParallelPartition the model across devices (e.g., pipeline parallelism).Model is too large for a single GPU (e.g., 100‑B parameter LLM).
HybridCombine both: shard large model across 2‑4 GPUs per node, then replicate the shard group across many nodes.Very large models with massive request volume.

Inference‑specific nuance: Unlike training, inference rarely requires gradient synchronization, so data‑parallel scaling is much simpler—just replicate the model and expose a load‑balanced endpoint.

4.2 Orchestration with Kubernetes and Ray

  • Kubernetes provides pod‑level replication, auto‑scaling based on custom metrics (e.g., GPU utilization). Use HorizontalPodAutoscaler (HPA) with a custom metrics adapter that reads GPU usage from NVIDIA DCGM.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-deployment
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: gpu_utilization
        selector:
          matchLabels:
            device: "GPU0"
      target:
        type: AverageValue
        averageValue: "70"
  • Ray Serve offers a higher‑level abstraction: define a deployment that automatically batches incoming requests, and Ray will schedule it across a cluster of nodes. It also integrates with Ray Train for model‑parallel inference.
from ray import serve

@serve.deployment(num_replicas=4, max_batch_size=64, batch_wait_timeout_s=0.005)
class MyModel:
    def __init__(self):
        self.model = load_my_trt_engine()
    async def __call__(self, request):
        inputs = preprocess(request.json())
        outputs = self.model.infer_async(inputs)
        return postprocess(outputs)

MyModel.deploy()

Ray’s batch_wait_timeout_s is essentially the max_wait we discussed earlier, making adaptive batching a built‑in feature.

4.3 Network Considerations

When inference spans multiple nodes, inter‑node latency can dominate. Strategies to mitigate:

  • RDMA / RoCE – Direct memory access over Ethernet eliminates kernel‑space copies; supported by NCCL and gRPC‑based transports.
  • gRPC with protobuf compression – Keep payload size low; compress large tensors (e.g., INT8) before transmission.
  • Model‑output streaming – For large outputs (e.g., token generation), stream partial results instead of waiting for the entire batch.

A typical pipeline:

  1. Front‑end receives request → extracts key → hashes to worker node.
  2. Worker node’s batcher aggregates requests locally (no network cost).
  3. If a request needs a shard on another node (model‑parallel), the batcher uses high‑speed NCCL to exchange intermediate tensors.
  4. Final output is sent back to the front‑end, which forwards to the client.

5. End‑to‑End Case Study: Real‑Time Recommendation at Scale

5.1 Problem Statement

A global e‑commerce platform serves personalized product recommendations to 150 M daily active users. During peak hours, the service must handle ≈ 80 k requests per second with a 99‑th percentile latency ≤ 30 ms. The model is a two‑tower deep‑learning architecture:

  • User tower – 12‑layer MLP, input size 256 features.
  • Item tower – 10‑layer MLP, input size 128 features.
  • Final dot‑product similarity computed on GPU.

The model fits comfortably on a single NVIDIA A100, but the traffic exceeds what a single GPU can sustain.

5.2 Baseline System

  • Deployment: Single‑node Triton Inference Server with default TensorRT engine.
  • Batching: Fixed batch size of 32.
  • Metrics: Throughput ~ 12 k rps; 99‑pct latency ≈ 78 ms.

The bottlenecks identified via Nsight:

  • Kernel 1 – GEMM for user tower (≈ 45 % of GPU time) using generic cuBLAS.
  • Kernel 2 – GEMM for item tower (≈ 30 %).
  • Batch wait – Under low traffic, the server waited up to 25 ms for the batch to fill.

5.3 Optimizations Applied

5.3.1 Custom Kernel Fusion

  • Developed a single fused kernel that computes both towers and the final dot‑product in one launch.
  • Leveraged Tensor‑Core half‑precision (FP16) with bias addition and ReLU fused.
  • Result: GPU compute time reduced from 75 ms to 28 ms per batch.

5.3.2 Adaptive Batching

  • Switched to the async_batcher implementation from Section 3.
  • Configured max_wait = 10 ms (≈ 33 % of latency budget) and allowed batch size to vary between 8 and 64 based on EMA of incoming request rate.
  • Under heavy load, batch size grew to 48; under light load, it dropped to 12, keeping average wait time < 7 ms.

5.3.3 Horizontal Scaling

  • Deployed 4 replicas of the custom engine using Kubernetes HPA, each bound to its own A100.
  • Added consistent hashing in Envoy to keep a user’s requests routed to the same replica (improves cache hit rate for embedding tables).

5.4 Results

MetricBeforeAfter
Throughput12 k rps48 k rps (≈ 4×)
99‑pct latency78 ms28 ms
GPU utilization45 %92 %
CPU‑to‑GPU copy time12 ms per batch3 ms per batch (pinned memory)

The combination of a single fused kernel and adaptive batching yielded more than a 3× latency reduction, while horizontal scaling provided the necessary throughput to meet peak demand.

5.5 Lessons Learned

  1. Profiling first – The custom kernel effort was justified only after confirming that GEMMs consumed the majority of GPU time.
  2. Latency‑budget allocation – Reserving half the SLO for compute gave enough headroom for network and preprocessing.
  3. Dynamic batch sizing – A simple EMA of request rate proved sufficient; more sophisticated predictors (e.g., ARIMA) gave marginal gains but added complexity.
  4. Observability – Exporting per‑batch latency, batch size, and GPU occupancy to Prometheus allowed the HPA to react promptly to traffic spikes.

6. Best Practices and Pitfalls

6.1 Monitoring & Observability

  • Metrics to expose:
    • batch_size, batch_wait_time_ms, kernel_compute_ms
    • GPU utilization (nvidia_smi or DCGM exporter)
    • Request‑level latency histogram (e.g., http_request_duration_seconds_bucket)
  • Tracing – Use OpenTelemetry to link request IDs across ingress, batcher, and executor. This helps pinpoint whether latency is caused by waiting, compute, or network.

6.2 Versioning & Reproducibility

  • Docker images should be built with the exact CUDA, cuDNN, and TensorRT versions.
  • Kernel binaries (e.g., .so files) must be version‑tagged and loaded via a deterministic naming scheme.
  • Store model artifacts (weights, ONNX files) alongside the custom kernel source in a CI/CD repository.

6.3 Hardware‑Specific Quirks

GPUQuirkMitigation
A100Tensor‑Core requires matrix dimensions to be multiples of 8 for FP16.Pad inputs or use kernel that handles remainder tiles.
T4Lower memory bandwidth; kernel fusion yields larger gains.Prioritize fused kernels over pure compute.
CPU (AVX‑512)Vector width mismatches with custom kernels compiled for 256‑bit AVX.Re‑compile with appropriate -mavx512f flag or use oneDNN primitives.

6.4 Common Pitfalls

  • Over‑fusing – Combining too many ops can increase register pressure, causing spills and reducing performance. Profile after each fusion step.
  • Ignoring warm‑up – First few batches often suffer from kernel loading overhead; exclude them from latency statistics.
  • Static batch timeout – A hard‑coded max_wait may be too aggressive during traffic spikes, leading to under‑utilized GPUs. Adjust dynamically based on observed queue depth.

7. Future Directions

7.1 Compiler‑Driven Optimizations

  • TVM and TensorFlow XLA now support auto‑scheduling that can generate kernels specialized for a given input shape, precision, and target hardware. They can also perform layout transformations (e.g., NHWC → NCHW) automatically.
  • Emerging MLIR‑based pipelines aim to unify front‑ends (PyTorch, TensorFlow) with backend codegen, reducing the need for hand‑written kernels.

7.2 Serverless Inference

Platforms such as AWS Lambda + Inferentia or Google Cloud Run + GPUs are making “function‑as‑a‑service” inference possible. The challenge is to preserve batching across stateless invocations—research is exploring stateful sidecars that maintain a batch buffer across function calls.

7.3 Auto‑Tuning Batching Policies

Machine‑learning‑based controllers (e.g., reinforcement‑learning agents) can learn optimal batch size and timeout policies by observing real‑time SLA violations. Early prototypes show 5‑10 % latency improvement over hand‑tuned EMA policies.


Conclusion

Scaling distributed inference engines is a multi‑dimensional problem that sits at the intersection of systems engineering, hardware‑aware programming, and machine‑learning performance. By:

  1. Profiling to locate the heaviest kernels,
  2. Crafting custom fused kernels that exploit the exact shape, precision, and hardware features of your model,
  3. Implementing adaptive batching that respects latency budgets while maximizing throughput,
  4. Deploying across a cluster with proper orchestration, networking, and observability,

you can achieve dramatic reductions in latency and massive gains in throughput—often without adding new hardware. The case study demonstrates that a well‑engineered combination of these techniques can turn a 12 k rps service into a 48 k rps, sub‑30 ms latency system.

As the AI ecosystem evolves, expect compilers to take over more of the kernel‑generation workload, and serverless platforms to demand smarter batching across stateless boundaries. Nevertheless, the principles outlined here—measure first, fuse where it matters, and batch adaptively—will remain foundational for any organization that needs to serve machine‑learning models at scale.


Resources