Optimizing Small Language Models: Pruning, Quantization, and Techniques for Local Edge Inference
A practical guide for engineers who need to run LLMs on edge hardware, covering pruning, quantization, and architecture patterns that keep latency low and memory tight.