Benchmarking Memory‑Efficient Transformer Architectures for Real‑Time Inference on Embedded Systems
Table of Contents Introduction Why Transformers on Embedded Devices? Memory‑Efficient Transformer Variants 3.1 DistilBERT & TinyBERT 3.2 MobileBERT 3.3 Linformer 3.4 Performer & FAVOR+ 3.5 Reformer 3.6 Quantized & Pruned Models Embedded Platforms & Toolchains Benchmark Design 5.1 Metrics to Capture 5.2 Datasets & Workloads 5.3 Measurement Methodology Implementation Walk‑Through 6.1 Preparing a Model with Hugging Face & ONNX 6.2 Converting to TensorFlow Lite (TFLite) 6.3 Deploying on a Cortex‑M55 MCU Experimental Results 7.1 Latency & Throughput 7.2 Memory Footprint 7.3 Energy Consumption 7.4 Accuracy Trade‑offs Interpretation & Best‑Practice Guidelines Future Directions Conclusion Resources Introduction Transformer models have become the de‑facto standard for natural language processing (NLP), computer vision, and increasingly for multimodal AI. Their self‑attention mechanism enables unprecedented performance on tasks ranging from language translation to object detection. However, the same architectural strengths that make transformers powerful also make them resource‑hungry: they demand gigabytes of RAM, billions of FLOPs, and high‑throughput memory bandwidth. ...