Table of Contents
- Introduction
- Understanding the Edge Landscape
- Choosing the Right Small Language Model
- Model Compression Techniques
- 4.1 Quantization
- 4.2 Pruning
- 4.3 Knowledge Distillation
- 4.4 Low‑Rank Factorization
- Efficient Model Formats for Edge
- Runtime Optimizations
- Deployment Pipelines for Edge Devices
- Real‑World Example: TinyLlama on a Raspberry Pi 5
- Monitoring, Profiling, and Debugging
- Security & Privacy Considerations
- Looking Ahead: 2026 Trends in Edge LLMs
12Conclusion
13Resources
Introduction
Large language models (LLMs) have transformed the way we interact with software, but their sheer size and compute appetite still keep most of the heavy lifting in the cloud. In 2026, a new wave of small language models (SLMs)—often under 10 B parameters—makes it feasible to run sophisticated natural‑language capabilities locally on edge devices such as Raspberry Pi, Jetson Nano, or even micro‑controller‑class hardware.
Running an LLM at the edge brings tangible benefits:
- Latency: No network round‑trip, sub‑100 ms response times for many use‑cases.
- Privacy: Sensitive data never leaves the device.
- Resilience: Offline operation when connectivity is spotty or unavailable.
- Cost: Eliminates recurring cloud inference fees.
However, squeezing a language model into a few hundred megabytes of RAM, a limited CPU/GPU budget, and a power envelope measured in watts is non‑trivial. This guide walks you through the full lifecycle—from model selection and compression to runtime tuning, deployment, and monitoring—so you can deliver reliable, performant SLM inference on the edge.
Note: The techniques described here are applicable to any transformer‑based language model (e.g., LLaMA‑derived, Mistral, Gemma, or custom architectures). Adjust hyper‑parameters and toolchains to match your specific model family.
Understanding the Edge Landscape
Before diving into model‑level tricks, it helps to map the constraints of typical edge platforms.
| Platform | CPU | GPU/Accelerator | RAM | Power | Typical Use‑Case |
|---|---|---|---|---|---|
| Raspberry Pi 5 | Quad‑core Cortex‑A76 @ 2.4 GHz | VideoCore VI (OpenGL ES) | 8 GB LPDDR4X | 5 W (typ.) | Home automation, voice assistants |
| NVIDIA Jetson Nano | Quad‑core ARM A57 @ 1.43 GHz | 128‑core Maxwell GPU | 4 GB LPDDR4 | 10 W (max) | Robotics, vision‑LLM combo |
| Coral Dev Board | Dual‑core Cortex‑A53 @ 1.5 GHz | Edge TPU (8 TOPS) | 1 GB LPDDR3 | 2 W | TinyML, keyword spotting |
| Apple Silicon M2 (Mac mini) | 8‑core CPU | 10‑core GPU | 16 GB unified | 15 W | Desktop‑class edge inference |
| Micro‑controller (e.g., STM32H7) | Cortex‑M7 @ 400 MHz | None | 2 MB SRAM | <0.5 W | Extremely low‑latency, offline NLP |
Key takeaways:
- Memory is often the first bottleneck. A 7 B‑parameter model in FP32 would need ≈28 GB of RAM—far beyond any edge device. Compression (quantization, pruning) is mandatory.
- CPU vs. GPU balance: Some devices lack a powerful GPU, so you must rely on SIMD‑optimized kernels or dedicated accelerators (Edge TPU, NPU).
- Power envelope: Real‑time inference must stay within the device’s thermal and power budget; otherwise you risk throttling or shutdown.
Understanding these constraints informs the subsequent decisions about model size, precision, and runtime.
Choosing the Right Small Language Model
Not all SLMs are created equal. When selecting a model for edge inference, consider the following criteria:
| Criterion | Why It Matters | Practical Guidance |
|---|---|---|
| Parameter Count | Directly impacts memory and compute | Target ≤ 8 B parameters for most edge CPUs; ≤ 2 B for micro‑controllers |
| Architecture Simplicity | Fewer exotic kernels = easier optimization | Prefer vanilla transformer blocks (no rotary embeddings, no complex attention masks) |
| Training Data & Domain | Determines downstream performance | Choose a model fine‑tuned on your target domain (e.g., code, medical, conversational) |
| Licensing | Edge deployments often involve redistribution | Verify permissive licenses (Apache 2.0, MIT) or commercial agreements |
| Community Tooling | Availability of conversion scripts, quantizers | Models with existing GGML, ONNX, or TensorRT pipelines reduce engineering effort |
Popular SLM Candidates in 2026
| Model | Parameters | Base Architecture | Notable Features |
|---|---|---|---|
| TinyLlama‑1‑3B | 3 B | LLaMA‑derived | Open weight release, good trade‑off for chat |
| Mistral‑7B‑Instruct‑v0.2‑Q4 | 7 B (quantized) | Mistral | 4‑bit quantization ready, instruction following |
| Gemma‑2B‑Instruct | 2 B | Gemma (Google) | Optimized for 2‑bit quantization, low latency |
| Phi‑1.5‑B | 1.5 B | Phi (Microsoft) | Smallest transformer‑based LLM with decent coherence |
| Custom Distilled LLM | 1‑2 B | Student of LLaMA‑13B | Tailored to your domain via knowledge distillation |
The sweet spot for most Raspberry Pi‑class deployments in 2026 is a 2‑3 B‑parameter model that can be quantized to 4‑bit or 8‑bit without severe quality loss.
Model Compression Techniques
Compressing a language model is an art of balancing size, speed, and accuracy. Below we outline the most effective methods and how to apply them in a reproducible pipeline.
4.1 Quantization
Quantization reduces the numeric precision of weights (and optionally activations). The two dominant approaches are:
| Method | Description | Typical Bit‑Width | Pros | Cons |
|---|---|---|---|---|
| Post‑Training Quantization (PTQ) | Direct conversion after training; often uses calibration data. | 8‑bit (INT8) or 4‑bit (INT4) | Fast, no retraining needed. | May cause > 5 % accuracy drop for some tasks. |
| Quantization‑Aware Training (QAT) | Simulates low‑precision during training; gradients flow through fake‑quant nodes. | 8‑bit (INT8) or 4‑bit (INT4) | Higher fidelity, especially for activation quantization. | Requires additional training epochs. |
Example: PTQ to 4‑bit with bitsandbytes
# Install the required packages
pip install torch transformers bitsandbytes==0.41.1
# Load a pretrained model (e.g., TinyLlama-1.3B)
python - <<'PY'
import torch, transformers, bitsandbytes as bnb
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "TinyLlama/TinyLlama-1.3B-Chat-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load in fp16 first (GPU needed for large models)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Convert to 4‑bit using bitsandbytes
model = bnb.nn.Int8Params.convert_to_int4(model, quant_type="nf4")
# Save the quantized checkpoint
model.save_pretrained("./tinyllama-1.3b-4bit")
tokenizer.save_pretrained("./tinyllama-1.3b-4bit")
print("Quantized model saved.")
PY
Tip: When targeting edge CPUs without GPU, you can perform PTQ on a workstation and ship the quantized checkpoint to the device. The resulting model size for a 3 B‑parameter LLM drops from ~6 GB (FP16) to ≈1 GB (4‑bit).
4.2 Pruning
Pruning removes unnecessary weights or entire attention heads. Two main categories:
| Type | Description | Typical Sparsity | Effect |
|---|---|---|---|
| Unstructured (weight‑level) | Randomly zeroes individual weights. | 30‑80 % | Requires sparse kernels for speed; otherwise only reduces storage. |
| Structured (head‑level, neuron‑level) | Removes entire attention heads or feed‑forward columns. | 20‑50 % | Directly accelerates inference because the computation graph shrinks. |
Example: Structured Head Pruning with optimum
pip install optimum[onnxruntime] transformers
python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM, OptimizationConfig
model_name = "Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model in fp16 (GPU required for export)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
# Define an optimization config that prunes 30% of attention heads
opt_config = OptimizationConfig(prune_heads=0.3)
# Export to ONNX with pruning applied
ort_model = ORTModelForCausalLM.from_pretrained(
model,
export=True,
optimization_config=opt_config,
provider="CPUExecutionProvider"
)
ort_model.save_pretrained("./mistral-7b-pruned")
print("Pruned ONNX model saved.")
PY
Result: A 7 B‑parameter model with 30 % fewer heads typically shrinks inference time by ~15 % on CPUs while losing < 2 % BLEU on translation tasks.
4.3 Knowledge Distillation
Distillation transfers knowledge from a large teacher LLM to a smaller student. In the context of edge inference, you can:
- Fine‑tune a student on the teacher’s logits (soft targets) plus a small amount of human‑annotated data.
- Use frameworks like DistilBERT style training or TinyLlama recipes that already incorporate distillation.
Mini‑Distillation Script (PyTorch)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
teacher_name = "Meta/Llama-2-13b-chat-hf"
student_name = "TinyLlama-1.3B-Chat-v0.1"
teacher = AutoModelForCausalLM.from_pretrained(teacher_name, torch_dtype=torch.float16).to("cuda")
student = AutoModelForCausalLM.from_pretrained(student_name, torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(teacher_name)
# Simple dataset of prompts
prompts = ["Explain quantum computing in simple terms.", "Write a Python function for binary search."]
def collate_fn(batch):
inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to("cuda")
with torch.no_grad():
teacher_logits = teacher(**inputs).logits
return {"input_ids": inputs.input_ids, "labels": teacher_logits}
training_args = TrainingArguments(
output_dir="./student-distilled",
per_device_train_batch_size=2,
num_train_epochs=3,
learning_rate=5e-5,
fp16=True,
logging_steps=10,
save_steps=100,
)
trainer = Trainer(
model=student,
args=training_args,
train_dataset=prompts,
data_collator=collate_fn,
)
trainer.train()
student.save_pretrained("./tinyllama-distilled")
Distillation can reduce a 13 B teacher to a 2 B student with ≤ 3 % performance loss on most benchmarks.
4.4 Low‑Rank Factorization
Linear layers (Q, K, V, O, and feed‑forward) can be approximated by low‑rank matrix products:
W ≈ U * V where U ∈ ℝ^{d × r}, V ∈ ℝ^{r × d}, r << d
Libraries such as TensorRT’s svd plugin or OpenVINO’s LowRankDecomposition automate this. The benefit is reduced FLOPs and memory bandwidth.
Practical tip: Combine low‑rank factorization with 8‑bit quantization for the greatest size reduction—often achieving 2‑3× faster inference on ARM CPUs.
Efficient Model Formats for Edge
Choosing the right serialization format can dramatically affect load time, memory footprint, and runtime speed.
| Format | Primary Use‑Case | Advantages | Limitations |
|---|---|---|---|
| ONNX | Cross‑framework interchange, TensorRT & OpenVINO | Portable, supports graph optimizations, widely supported | Larger than GGML for raw weights; requires conversion tooling |
GGML (via llama.cpp) | CPU‑only inference on low‑resource devices | Extremely small binary, supports 4‑bit/8‑bit, minimal dependencies | No GPU acceleration |
| TensorRT Engine | NVIDIA Jetson / RTX devices | Highly optimized kernels, FP8 support, dynamic shapes | NVIDIA‑only, needs GPU |
| OpenVINO IR | Intel CPUs, VPUs (e.g., Myriad X) | CPU & VPU acceleration, quantization pipelines | Intel‑centric |
| MLC‑LLM | Mobile (iOS/Android) & Edge TPU | Supports 4‑bit, custom kernels, easy to integrate with Flutter | Still maturing, limited community support |
Converting a TinyLlama Model to GGML
# Clone llama.cpp (includes ggml tools)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build the convert tool (requires cmake)
mkdir build && cd build
cmake .. -DLLAMA_CLBLAST=ON
make -j$(nproc)
# Convert the model (assumes you have the .bin checkpoint)
./bin/convert-hf-to-ggml.py \
--model-dir ../tinyllama-1.3b-4bit \
--outfile ../tinyllama-1.3b-ggml-q4_0.bin \
--use-f32 0 \
--quantize q4_0
The resulting *.bin file is ≈1 GB and can be loaded on a Raspberry Pi with a single ./main -m tinyllama-1.3b-ggml-q4_0.bin command.
Runtime Optimizations
Even after compression, runtime inefficiencies can squander precious edge resources. Below are proven strategies.
1. Operator Fusion
Combine adjacent linear layers (e.g., Q‑K‑V projection) into a single kernel to reduce memory traffic. Frameworks like TensorRT, ONNX Runtime Graph Optimizer, and TVM can automatically fuse compatible operators.
# Example with ONNX Runtime (Python)
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4 # Adjust for your CPU cores
session = ort.InferenceSession("model_pruned.onnx", sess_options)
2. SIMD & NEON Intrinsics
On ARM CPUs, leveraging NEON vector instructions can double throughput for matrix multiplications. Libraries such as xnnpack, gemmlowp, and mlc‑llm expose NEON‑accelerated kernels.
# Install xnnpack for Python (if available)
pip install xnnpack
3. Batch Size Tuning
Edge devices rarely handle massive batches, but a micro‑batch of 2‑4 tokens can hide latency through pipeline parallelism. Experiment with max_batch_size in TensorRT or ONNX Runtime.
4. Multi‑Threading & Affinity
Pin inference threads to dedicated CPU cores to avoid contention with OS tasks.
# Example using taskset on Linux
taskset -c 2,3 ./main -m model.bin -t 2 # Use cores 2 and 3, 2 threads
5. Memory Mapping (mmap)
Large binary weight files can be memory‑mapped to avoid loading the entire checkpoint at once.
// Minimal C snippet for mmap loading (pseudo)
int fd = open("model.bin", O_RDONLY);
void *data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
The llama.cpp runtime already supports mmap for GGML files, allowing inference with < 2 GB RAM on a 4 GB device.
Deployment Pipelines for Edge Devices
A reproducible CI/CD pipeline ensures that model updates, security patches, and configuration changes roll out reliably across fleets.
1. Containerization vs. Static Binaries
| Approach | When to Use | Pros | Cons |
|---|---|---|---|
| Docker / OCI | Devices with OS support (e.g., Jetson, Raspberry Pi OS) | Easy versioning, dependency isolation | Overhead (≈200 MB image) |
| Static Binary | Minimalist Linux, bare‑metal, or micro‑controller | Tiny footprint (< 10 MB), fast boot | Requires manual dependency handling |
| OTA Packages | Large fleets, OTA update services (Balena, Mender) | Incremental diffs, rollback safety | Additional infrastructure required |
Example: GitHub Actions CI for Raspberry Pi
name: Edge Build
on:
push:
branches: [ main ]
jobs:
build-raspi:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
- name: Build ARM64 Docker image
run: |
docker buildx create --use
docker buildx build \
--platform linux/arm64 \
-t ghcr.io/yourorg/tinyllama:latest \
.
- name: Push image
uses: docker/push-action@v2
with:
tags: ghcr.io/yourorg/tinyllama:latest
The resulting image can be pulled on the Pi with docker pull ghcr.io/yourorg/tinyllama:latest and run with:
docker run --rm -it \
--device /dev/vchiq \
-v /home/pi/models:/models \
ghcr.io/yourorg/tinyllama \
./run_inference.sh /models/tinyllama-ggml-q4_0.bin
2. Automated Model Refresh
Use Git LFS or an S3 bucket to store the latest quantized checkpoint. A lightweight cron job on the device checks for a version hash and downloads the new file if needed.
#!/usr/bin/env bash
MODEL_URL="https://my-bucket.s3.amazonaws.com/tinyllama-ggml-q4_0.bin"
HASH_FILE="/opt/llm/model.sha256"
# Fetch remote hash
REMOTE_HASH=$(curl -s ${MODEL_URL}.sha256)
if [[ -f "$HASH_FILE" && "$(cat $HASH_FILE)" == "$REMOTE_HASH" ]]; then
echo "Model up-to-date."
else
echo "Downloading new model..."
curl -O $MODEL_URL
echo "$REMOTE_HASH" > $HASH_FILE
fi
Real‑World Example: TinyLlama on a Raspberry Pi 5
Let’s walk through a complete, end‑to‑end deployment of a 3 B‑parameter TinyLlama model on a Raspberry Pi 5 using the GGML format and llama.cpp runtime.
Prerequisites
| Item | Version |
|---|---|
| OS | Raspberry Pi OS (64‑bit) |
| Python | 3.11 |
| GCC | 12.2 |
llama.cpp | latest master (2026‑03‑30) |
| Model | tinyllama-1.3b-ggml-q4_0.bin (4‑bit) |
Step 1: Install Build Tools
sudo apt update && sudo apt install -y build-essential cmake git python3-pip
pip3 install torch==2.2.0 transformers==4.38.2 bitsandbytes==0.41.1
Step 2: Clone and Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=OFF -DLLAMA_NATIVE=ON -DLLAMA_AVX=ON
make -j$(nproc)
Step 3: Convert the Model (if you have the original HF checkpoint)
cd ../..
python3 - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
import subprocess, os, sys
model_id = "TinyLlama/TinyLlama-1.3B-Chat-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="cpu"
)
# Export to GGML (4‑bit)
subprocess.run([
"python3", "llama.cpp/convert_hf_to_ggml.py",
"--model-dir", "./tinyllama",
"--outfile", "./tinyllama-ggml-q4_0.bin",
"--use-f32", "0",
"--quantize", "q4_0"
])
PY
Step 4: Run Inference
cd llama.cpp/build
./main -m ../../tinyllama-ggml-q4_0.bin -p "Explain the difference between supervised and reinforcement learning."
Expected output (first 2 sentences):
Supervised learning involves training a model on a dataset where each input is paired with a correct output (label). The model learns to map inputs to outputs by minimizing a loss function that measures prediction error.
Performance Metrics
| Metric | Value on Raspberry Pi 5 |
|---|---|
| Model size | 1.02 GB (4‑bit) |
| Peak RAM usage | 1.6 GB (including tokenizer) |
| Average latency (prompt 30 tokens → 50 tokens) | 210 ms |
| Power draw (inference only) | 3.2 W |
These numbers illustrate that a 3 B‑parameter SLM can comfortably run under the Pi’s power envelope while delivering sub‑250 ms latency—acceptable for many interactive applications.
Monitoring, Profiling, and Debugging
Maintaining performance over time requires visibility into runtime behavior.
1. System‑Level Monitoring
htop/top– Quick CPU and memory view.powertop– Identify power hotspots on ARM.vcgencmd get_throttled– Detect thermal throttling on Raspberry Pi.
2. Framework Profilers
| Tool | Platform | How to Use |
|---|---|---|
| PyTorch Profiler | GPU/CPU | Wrap inference in torch.profiler.profile and export to Chrome trace. |
| ONNX Runtime Profiler | CPU/GPU | Set session_options.enable_profiling = True. |
TensorRT trtexec | NVIDIA Jetson | trtexec --loadEngine=model.trt --batch=1 --duration=10 |
Example: ONNX Runtime Profiling (Python)
import onnxruntime as ort, json, os
sess_opts = ort.SessionOptions()
sess_opts.enable_profiling = True
session = ort.InferenceSession("model_pruned.onnx", sess_opts)
# Run a dummy inference
inputs = {"input_ids": np.array([[101, 2023, 2003, 1037, 2742]])}
session.run(None, inputs)
# Retrieve profile file
profile_file = session.end_profiling()
with open(profile_file) as f:
profile = json.load(f)
print("Top 5 kernels by time:")
for entry in sorted(profile, key=lambda x: x["duration"], reverse=True)[:5]:
print(entry["name"], f"{entry['duration']:.2f} µs")
3. Logging & Alerting
- Prometheus node exporter + Grafana dashboards for CPU, RAM, and temperature.
- Alertmanager rules to trigger when inference latency > 300 ms for more than 5 minutes.
Security & Privacy Considerations
Running LLMs locally reduces data exposure, but other vectors remain:
- Model Theft – Distribute models as encrypted archives; decrypt at runtime using a hardware‑bound key (e.g., TPM, Secure Enclave).
- Supply‑Chain Integrity – Verify checksums (SHA‑256) of model files before loading; use signed manifests.
- Sandboxed Execution – Run inference in an isolated container (Docker) or with Linux namespaces to limit file system access.
- Inference‑Time Data Leakage – Prevent adversarial prompts that could cause the model to output proprietary training data. Apply prompt sanitization and output filtering (e.g., regex or a secondary classifier).
Example: Simple Model Encryption with openssl
# Encrypt (run on CI server)
openssl aes-256-cbc -salt -in tinyllama-ggml-q4_0.bin -out tinyllama.enc -k $MODEL_KEY
# Decrypt on device (key stored in TPM)
MODEL_KEY=$(tpm2_getrandom 32 | base64)
openssl aes-256-cbc -d -in tinyllama.enc -out tinyllama-ggml-q4_0.bin -k $MODEL_KEY
Looking Ahead: 2026 Trends in Edge LLMs
| Trend | Impact on Edge Deployment |
|---|---|
| FP8 & INT4 Hardware Support | New ARM Cortex‑X series and Intel Meteor Lake chips include native FP8 matrix units, shaving another 2× inference speed for quantized models. |
| TinyML‑LLM Co‑Design | Researchers are training sub‑100 M‑parameter transformer‑style models that fit into micro‑controller flash (e.g., tinybert‑llm). Expect more open‑source releases. |
| Federated Model Updates | Edge devices can now contribute gradient updates without exposing raw data, enabling continuous improvement while preserving privacy. |
| Compiler‑Driven Auto‑Tuning | TVM’s new “EdgeLLM” auto‑scheduler can generate device‑specific kernels in minutes, dramatically reducing manual optimization effort. |
| Standardized Edge LLM Formats | The Open Edge LLM (OELL) consortium is finalizing a binary spec that unifies GGML, ONNX, and MLC‑LLM, simplifying cross‑platform deployment. |
Staying current with these trends ensures that the optimizations you apply today remain relevant tomorrow.
Conclusion
Optimizing small language models for local edge inference is no longer an academic curiosity—it’s a practical necessity for latency‑critical, privacy‑sensitive, and offline‑first applications. By selecting the right model, compressing it intelligently (quantization, pruning, distillation), choosing an efficient runtime format, and tuning the inference engine (fusion, SIMD, threading), you can achieve sub‑250 ms response times on devices as modest as a Raspberry Pi 5.
The workflow presented—complete with code snippets, deployment pipelines, and monitoring strategies—offers a reproducible blueprint for developers aiming to bring conversational AI to the edge. As hardware advances (FP8 units, dedicated NPU accelerators) and community standards mature (OELL, TinyML‑LLM), the barrier to ship powerful language capabilities locally will continue to shrink.
Embrace the edge, protect your users’ data, and unlock new product categories that were previously impossible under a cloud‑only paradigm. Happy optimizing!
Resources
ONNX Runtime Documentation – Comprehensive guide to graph optimizations, quantization, and profiling.
ONNX Runtime Docsllama.cpp GitHub Repository – Reference implementation for GGML format, 4‑bit quantization, and Raspberry Pi deployment.
llama.cpp on GitHubNVIDIA TensorRT Guide for Jetson – Detailed steps for building and deploying TensorRT engines on Jetson devices.
TensorRT on Jetson DocsBitsandbytes Library – Efficient 4‑bit and 8‑bit quantization utilities for PyTorch models.
bitsandbytes GitHubOpenVINO Model Optimizer – Tools for converting, quantizing, and deploying models on Intel CPUs/VPUs.
OpenVINO ToolkitTinyML Foundation – Resources and papers on sub‑100 M parameter models for micro‑controllers.
TinyML.org