Orchestrating Cross-Shard Consistency for Distributed Inference in Decentralized Heterogeneous Compute Clusters
Introduction The rise of large‑scale neural models—such as transformer‑based language models with billions of parameters—has pushed inference workloads beyond the capacity of a single GPU or even a single server. To meet latency, throughput, and cost constraints, organizations increasingly slice models across shards (sub‑models) and spread those shards across a decentralized heterogeneous compute cluster. In such an environment, each shard may run on a different hardware accelerator (GPU, TPU, FPGA, or even CPU) and be managed by distinct orchestration layers (Kubernetes, Nomad, custom edge‑node managers, etc.). ...