1-Bit

Introduction Large language models (LLMs) have transformed how we interact with text, code, and even multimodal data. Yet the most powerful models—GPT‑4, Claude, Llama‑2‑70B—require hundreds of gigabytes of memory and powerful GPUs to run, limiting their use to cloud environments. Edge devices—smartphones, IoT gateways, micro‑robots, and AR glasses—operate under strict constraints: Memory: Often less than 2 GB of RAM. Compute: Fixed‑point or low‑power CPUs/NPUs, rarely a desktop‑class GPU. Latency: Real‑time interaction demands sub‑100 ms inference. Privacy: On‑device processing avoids sending sensitive data to the cloud. The emerging 1‑bit quantization (also called binary or ternary quantization when a small number of extra states are added) promises to shrink model size by 32× compared to full‑precision (FP32) weights. When combined with modern parameter‑efficient fine‑tuning techniques (LoRA, adapters, prefix‑tuning), we can adapt a large pre‑trained model to a specific domain while keeping the footprint manageable for edge deployment. ...