Model Compression

Introduction Large language models (LLMs) have reshaped natural‑language processing (NLP) by delivering impressive capabilities—from code generation to conversational agents. Yet the majority of these breakthroughs rely on massive cloud‑based infrastructures that demand terabytes of storage, multi‑GPU clusters, and high‑bandwidth network connections. For many real‑world applications—smartphones, wearables, industrial IoT gateways, autonomous drones, and AR/VR headsets—latency, privacy, and connectivity constraints make cloud‑only inference impractical. Enter local LLMs, a rapidly growing ecosystem of compact, efficient models designed to run on‑device or at the edge. This article provides a deep dive into the state of local LLMs, focusing on the technical strategies that enable small language models to operate under tight memory, compute, and power budgets while still delivering useful functionality. We’ll explore the evolution of model compression, hardware‑aware design, deployment frameworks, and real‑world case studies, concluding with a practical example of running a 7 B‑parameter model on a Raspberry Pi 4. ...

Table of contents Introduction What is quantization (simple explanation) Why quantize LLMs? Costs, memory, and latency Quantization primitives and concepts Precision (bit widths) Range, scale and zero-point Uniform vs non-uniform quantization Blockwise and per-channel scaling Main quantization workflows Post-Training Quantization (PTQ) Quantization-Aware Training (QAT) Hybrid and mixed-precision approaches Practical algorithms and techniques Linear (symmetric) quantization Affine (zero-point) quantization Blockwise / groupwise quantization K-means and non-uniform quantization Persistent or learned scales, GPTQ-style (second-order aware) methods Quantizing KV caches and activations Tools, libraries and ecosystem (how to get started) Bitsandbytes, GGML, Hugging Face & Quanto, PyTorch, GPTQ implementations End-to-end example: quantize a transformer weight matrix (code) Best practices and debugging tips Limitations and failure modes Future directions Conclusion Resources Introduction Quantization reduces the numeric precision of a model’s parameters (and sometimes activations) so that a trained Large Language Model (LLM) needs fewer bits to store and compute with its values. The result: much smaller models, lower memory use, faster inference, and often reduced cost with only modest accuracy loss when done well[2][5]. ...

Model Compression

The State of Local LLMs: Optimizing Small Language Models for On-Device Edge Computing

How Quantization Works in LLMs: Zero to Hero