Zero to Production: Step-by-Step Fine-Tuning with Unsloth

Unsloth has quickly become one of the most practical ways to fine‑tune large language models (LLMs) efficiently on modest GPUs. It wraps popular open‑source models (like Llama, Mistral, Gemma, Phi) and optimizes training with techniques such as QLoRA, gradient checkpointing, and fused kernels—often cutting memory use by 50–60% and speeding up training significantly.

This guide walks you from zero to production:

Understanding what Unsloth is and when to use it
Setting up your environment
Preparing your dataset for instruction tuning
Loading and configuring a base model with Unsloth
Fine‑tuning with LoRA/QLoRA step by step
Evaluating the model
Exporting and deploying to production (vLLM, Hugging Face, etc.)
Practical tips and traps to avoid

All examples use Python and the Hugging Face ecosystem.

Introduction
1. What is Unsloth and Why Use It?
2. Prerequisites and Environment Setup
3. Choosing a Base Model and Use Case
4. Preparing Your Dataset (Zero to Clean SFT Data)
5. Loading a Model with Unsloth
- 5.1 Quickstart: QLoRA with Unsloth
- 5.2 Configuring LoRA adapters
6. Training: Step-by-Step Fine-Tuning
7. Evaluating Your Fine-Tuned Model
8. Saving, Merging, and Publishing Models
9. Deploying to Production
10. Best Practices and Common Pitfalls
Conclusion

1. What is Unsloth and Why Use It?

Unsloth is an open‑source library focused on fast, memory‑efficient fine‑tuning of LLMs, especially using LoRA/QLoRA. It builds on top of the Hugging Face ecosystem and integrates tightly with transformers and trl (for supervised fine‑tuning).

Key advantages:

Lower VRAM: QLoRA with 4‑bit quantization lets you fine‑tune 7–8B models on 1× 16–24 GB GPU, sometimes smaller.
Speed: Custom kernels and optimizations often yield ~2× faster fine‑tuning versus a naïve transformers setup.
Simple API: A couple of calls (FastLanguageModel.from_pretrained and .get_peft_model) give you a ready‑to‑train model.
Compatible: Works with popular checkpoints from Hugging Face (Llama, Mistral, Gemma, Phi, Qwen, etc.—check the Unsloth docs for the current list).

Use Unsloth when:

You want instruction‑tuning or domain adaptation of an existing open‑source LLM.
You have limited GPU memory but still want to train 7B+ models.
You prefer a Hugging Face‑centric workflow.

2. Prerequisites and Environment Setup

2.1 Hardware requirements

For a smooth experience:

GPU:
- 7B models: 1× 16–24 GB GPU (e.g., RTX 4090, A10G, A5000, L4, A100 40GB, etc.)
- 13B+ models: 1× 24–40 GB GPU or multiple GPUs (or aggressive QLoRA settings)
CPU: Any modern multi‑core CPU; more cores help with data pre‑processing.
RAM: 16 GB+ recommended, depending on dataset size.
Disk: At least 40–80 GB free if you download multiple models and save checkpoints.

2.2 Python environment

Use a fresh virtual environment or Conda:

python -m venv .venv
source .venv/bin/activate          # Linux / macOS
# .venv\Scripts\activate           # Windows (PowerShell)

Use Python 3.9–3.11 for best compatibility.

2.3 Installing Unsloth and dependencies

Install Unsloth plus core libraries:

pip install --upgrade pip

# Install Unsloth (choose the extra matching your setup per Unsloth docs)
pip install "unsloth[full]"  # or "unsloth[cuda]" / "unsloth[cpu]" depending on their README

# Hugging Face + TRL + evaluation
pip install "transformers>=4.40.0" "datasets>=2.18.0" "accelerate>=0.27.0" "trl>=0.8.0"
pip install bitsandbytes einops sentencepiece

Note
The exact extras for Unsloth (e.g., CUDA versions) may change. Always check the Unsloth GitHub repo for the current recommended install command.

Verify GPU visibility:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

3. Choosing a Base Model and Use Case

Before touching code, clarify:

Use case
- Customer support/chatbot
- Code assistant
- Domain‑specific Q&A (finance, medical, legal, etc.)
- Tool‑using agent / RAG reasoning booster
Language & size constraints
- English only vs multilingual
- Max latency & memory budget in production
- Open‑weight licensing requirements (e.g., commercial use)
Base model selection
Common choices (check Unsloth docs for supported variants):
- General chat: Llama‑3‑8B, Llama‑2‑7B, Mistral‑7B
- Code: CodeLlama, StarCoder‑like models (if/when supported)
- Low‑resource / compact: Phi‑3, Gemma‑2B/7B

For this tutorial, assume:

Use case: General instruction‑following chatbot
Base model: A 7–8B general model (e.g., a Llama‑style chat model on Hugging Face)

You can substitute your own model_name later.

4. Preparing Your Dataset (Zero to Clean SFT Data)

4.1 Recommended data format

For supervised fine‑tuning (SFT), a practical schema is:

{
  "instruction": "Explain the concept of overfitting in simple terms.",
  "input": "",
  "output": "Overfitting happens when a model memorizes the training data instead of learning general patterns..."
}

Fields:

instruction: user request or task description
input: optional extra context (docs, passages, question body)
output: ideal assistant answer

Store this as JSONL (.jsonl, one record per line) or as a Hugging Face dataset.

Example layout:

data/
  train.jsonl
  validation.jsonl

4.2 Example: loading and cleaning with `datasets`

from datasets import load_dataset

# If you have local JSONL files
dataset = load_dataset(
    "json",
    data_files={
        "train": "data/train.jsonl",
        "validation": "data/validation.jsonl",
    },
)

print(dataset)
print(dataset["train"][0])

Optional cleaning steps (recommended):

Drop very long or very short examples
Normalize whitespace
Remove duplicates by hashing (instruction, input, output)

def clean_example(ex):
    ex["instruction"] = ex["instruction"].strip()
    ex["input"] = ex.get("input", "").strip()
    ex["output"] = ex["output"].strip()
    return ex

dataset = dataset.map(clean_example)

# Filter by output length
def is_reasonable_length(ex, min_len=10, max_len=2048):
    n = len(ex["output"].split())
    return (n >= min_len) and (n <= max_len)

dataset = dataset.filter(is_reasonable_length)

4.3 Formatting prompts for chat models

Modern chat models expect special tokens and structured prompts, e.g.:

<|user|>
{instruction + optional input}
<|assistant|>
{output}

The exact template varies by model. Many Unsloth examples use something like:

SYSTEM_PROMPT = "You are a helpful, concise AI assistant."

def make_prompt(instruction, input_text=None, output=None):
    user_content = instruction if not input_text else f"{instruction}\n\nInput:\n{input_text}"
    text = f"""<|system|>
{SYSTEM_PROMPT}
<|user|>
{user_content}
<|assistant|>
"""
    if output is not None:
        text += output
    return text

We’ll use this structure to generate a single text field the trainer can consume.

def formatting_prompts_func(examples):
    texts = []
    for inst, inp, out in zip(
        examples["instruction"],
        examples.get("input", [""] * len(examples["instruction"])),
        examples["output"],
    ):
        texts.append(make_prompt(inst, inp, out))
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

After this, each sample has a text string that contains the full conversation.

5. Loading a Model with Unsloth

5.1 Quickstart: QLoRA with Unsloth

The central API in Unsloth is FastLanguageModel.

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048        # or 4096 depending on your GPU
dtype = None                 # let Unsloth pick (bfloat16/float16)
load_in_4bit = True          # QLoRA: 4-bit base model

model_name = "unsloth/llama-3-8b-bnb-4bit"  # EXAMPLE; use a supported model from Unsloth docs

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

Important
The model_name should be a quantized checkpoint compatible with Unsloth, often provided directly by the Unsloth team or as recommended in their docs (e.g., unsloth/<model-name>-bnb-4bit). Using arbitrary models may fail or be sub‑optimal.

5.2 Configuring LoRA adapters

Next, wrap the base model with LoRA/QLoRA adapters. This is where Unsloth applies many optimizations (e.g., gradient checkpointing, rank‑stabilized LoRA).

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (8–64 are typical)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",  # saves memory
    random_state=3407,
    use_rslora=True,  # often improves stability
    loftq_config=None,  # leave as default unless you know you need it
)

Parameters you’ll tune most often:

r: controls adapter capacity; higher can fit more complex tasks but uses more memory.
target_modules: which attention/MLP modules get LoRA weights. The default list above is a good start for decoder LLMs.
lora_dropout: small non‑zero value (0.05–0.1) can help regularization.
use_gradient_checkpointing: enables memory savings at the cost of extra compute; recommended for tight VRAM budgets.

6. Training: Step-by-Step Fine-Tuning

We’ll use the TRL SFTTrainer, which integrates nicely with Unsloth models for supervised fine‑tuning.

6.1 Training arguments

from trl import SFTTrainer
from transformers import TrainingArguments

BATCH_SIZE = 4
GRAD_ACCUM_STEPS = 4
EPOCHS = 3
LEARNING_RATE = 2e-4

training_args = TrainingArguments(
    output_dir="outputs/unsloth-llama3-chat",
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    max_grad_norm=1.0,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",

    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    report_to="none",   # or "wandb" / "tensorboard"

    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),

    optim="paged_adamw_32bit",  # bitsandbytes optimizer often used with QLoRA
)

6.2 Running the training loop

Create the trainer using our text field:

train_dataset = dataset["train"]
eval_dataset  = dataset.get("validation")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    packing=True,  # packs multiple samples per sequence for efficiency
    args=training_args,
)

Then train:

trainer.train()

This will:

Stream your dataset, tokenize, and pack samples into up to max_seq_length.
Fine‑tune only the LoRA parameters (the base 4‑bit model stays frozen).
Log losses and evaluation metrics per epoch.

After training completes, you’ll see checkpoints in outputs/unsloth-llama3-chat.

6.3 Monitoring and debugging training

A few sanity checks during training:

Loss decreasing?
- Training loss should fall steadily in the first epoch.
- Validation loss typically decreases but may plateau or rise if overfitting.
Memory issues (CUDA out of memory)
- Reduce per_device_train_batch_size.
- Reduce max_seq_length.
- Use use_gradient_checkpointing="unsloth" (already set).
- Lower r (LoRA rank).
Speed
- Try packing=True (already on) to reduce padding overhead.
- Use bf16 if supported; it often gives good speed vs fp16.

Example: quick manual evaluation after each epoch:

def chat(model, tokenizer, prompt, max_new_tokens=256):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_p=0.9,
            temperature=0.7,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

test_prompt = make_prompt("Explain QLoRA in one paragraph.", "", None)
print(chat(model, tokenizer, test_prompt))

7. Evaluating Your Fine-Tuned Model

For a production‑oriented project, you need structured evaluation beyond just checking a few prompts.

Approaches:

Manual eval / rubric
- Prepare a fixed set of ~50–200 prompts that represent real user queries.
- Have domain experts rate responses on criteria (correctness, helpfulness, safety).
Automatic metrics
- For classification‑like tasks: accuracy / F1.
- For QA: exact match / F1 against reference answers.
- For generation/style tasks, use model‑based annotators (e.g., LLM‑as‑a‑judge) with care.
Benchmark frameworks
- Use lm-eval-harness or similar to run standard tasks (if relevant).

Example minimal evaluation loop for a custom dataset:

import random

eval_samples = random.sample(list(eval_dataset), k=min(50, len(eval_dataset)))

for ex in eval_samples[:5]:
    prompt = make_prompt(ex["instruction"], ex.get("input", None), None)
    response = chat(model, tokenizer, prompt)
    print("INSTRUCTION:", ex["instruction"])
    print("EXPECTED:", ex["output"][:300], "...")
    print("MODEL:", response[:300], "...")
    print("=" * 80)

Document your evaluation procedure and results; this becomes crucial for regression testing and governance when you iterate models.

8. Saving, Merging, and Publishing Models

8.1 Saving LoRA adapters

The cheapest way to store and ship a fine‑tuned model is:

Keep the original base model as is (from Hugging Face or Unsloth).
Save only your LoRA adapter weights + tokenizer.

adapter_dir = "outputs/unsloth-llama3-chat-lora"

model.save_pretrained(adapter_dir)
tokenizer.save_pretrained(adapter_dir)

This directory now contains the LoRA adapter configuration and weights. To use it later:

from unsloth import FastLanguageModel

base_model_name = model_name  # same as used earlier

base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=True,
)

base_model = FastLanguageModel.load_peft_model(
    base_model,
    adapter_dir,
)

Note
The exact helper for loading PEFT/LoRA adapters may differ by Unsloth version. If load_peft_model isn’t available, you can fall back to standard PEFT APIs or check the current Unsloth docs.

8.2 (Optional) Merging LoRA into a full model

Sometimes you want a fully merged model (no external adapter), e.g., for deployment environments that don’t support LoRA, or for exporting to certain formats.

Unsloth typically offers a helper like FastLanguageModel.for_inference to:

Merge LoRA into the base model
Convert to half/bfloat16
Enable efficient inference structures

A typical pattern (conceptual):

from unsloth import FastLanguageModel

# Prepare for inference (may merge LoRA depending on settings/version)
FastLanguageModel.for_inference(model)

# Then save as a standard HF `transformers` model
merged_dir = "outputs/unsloth-llama3-chat-merged"
model.save_pretrained(merged_dir, safe_serialization=True)
tokenizer.save_pretrained(merged_dir)

Always check the current Unsloth docs for the recommended way to merge and export; APIs may evolve.

8.3 Pushing to the Hugging Face Hub

Publishing to the Hub makes production deployment and collaboration easier.

huggingface-cli login

Push from Python:

from huggingface_hub import HfApi, create_repo

repo_id = "your-username/llama3-chat-unsloth-lora"
create_repo(repo_id, private=True, exist_ok=True)

from transformers import AutoTokenizer, AutoModelForCausalLM

# If you're saving LoRA‑only or merged, just push that directory:
api = HfApi()
api.upload_folder(
    folder_path=adapter_dir,    # or merged_dir
    repo_id=repo_id,
    commit_message="Initial Unsloth fine-tuned model",
)

Now your model can be loaded from anywhere with:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

(Or via Unsloth if exporting with Unsloth‑friendly format.)

9. Deploying to Production

Once the model is evaluated and saved, you need a reliable, scalable inference stack. Two common routes:

9.1 Serving with vLLM

vLLM is a fast inference engine that supports many HF models and exposes an OpenAI‑compatible API.

Install:

pip install "vllm>=0.4.0"

Start an API server:

python -m vllm.entrypoints.openai.api_server \
  --model your-username/llama3-chat-unsloth-lora-or-merged \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16

Call it from your application:

import requests
import json

API_URL = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

payload = {
    "model": "your-username/llama3-chat-unsloth-lora-or-merged",
    "messages": [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain the benefits of QLoRA briefly."},
    ],
    "temperature": 0.7,
    "max_tokens": 256,
}

resp = requests.post(API_URL, headers=headers, data=json.dumps(payload))
print(resp.json()["choices"][0]["message"]["content"])

Deploy this container behind:

An HTTP load balancer
Authentication (API keys, OAuth, etc.)
Logging and observability stack (Prometheus, Grafana, ELK, etc.)

9.2 Hugging Face Inference Endpoints

For a fully managed deployment:

Push model to Hugging Face Hub (as above).
In the HF UI, create an Inference Endpoint for your repo.
Choose hardware (GPU type) and auto‑scaling.
Hit the endpoint’s OpenAI‑style or HF‑style API from your app.

Pros:

No infra to manage
Good for production with SLAs

Cons:

Ongoing cost
Less control than self‑hosted vLLM

9.3 Basic production checklist

Before calling it “production”:

Load testing: evaluate latency and throughput (e.g., Locust, k6).
Autoscaling: scale out under load (Kubernetes HPA, HF autoscaling, etc.).
Monitoring:
- Latency (p50, p90, p99)
- GPU/CPU usage
- Error rates (timeouts, OOMs)
Safety: content filters or guardrails if your use case demands it.
Versioning: tag your model + app versions; keep rollback paths.
Canary releases: route a small % of traffic to new versions first.

10. Best Practices and Common Pitfalls

Data & training

Garbage in, garbage out
- The single biggest determinant of quality is dataset quality.
- Remove toxic, outdated, or misleading examples for safety and correctness.
Domain balance
- If you add a lot of domain‑specific data (e.g., finance), ensure mixed general data if you still want broad capabilities.
Sequence length
- Don’t set max_seq_length to 8k just because it’s possible. Longer contexts are slower and need more VRAM.
- Match to realistic usage (2k–4k is enough for many assistant use cases).

Hyperparameters

Start with conservative settings:
- learning_rate: 2e‑4 or 1e‑4
- r: 8–16
- epochs: 1–3 (longer if dataset is small)
Monitor for overfitting: if validation loss keeps climbing, reduce epochs or add regularization (increase lora_dropout a bit).

Unsloth specifics

Use official example configs
- Unsloth’s README and notebooks include recommended parameters for each model family; start there and then adjust.
CUDA / bitsandbytes compatibility
- If you see weird 4‑bit errors, ensure your CUDA, PyTorch, and bitsandbytes versions match the Unsloth recommendations.
Gradient checkpointing
- Great for memory, but increases compute. If training is too slow and you have VRAM headroom, try disabling it.

Deployment

Keep LoRA separate if you want:
- Small artifacts to ship
- Ability to swap base models underneath
Merge weights if you:
- Need compatibility with tools that don’t support PEFT/LoRA
- Want a single self‑contained model
Cache warming:
- Run a few warm‑up requests before taking traffic to avoid cold‑start latency spikes.

Conclusion

Fine‑tuning LLMs used to require multi‑GPU clusters and complex engineering. Unsloth, combined with LoRA/QLoRA and the Hugging Face ecosystem, makes it feasible to:

Start with a strong open‑source base model
Prepare a relatively

Table of contents#

1. What is Unsloth and Why Use It?#

2. Prerequisites and Environment Setup#

2.1 Hardware requirements#

2.2 Python environment#

2.3 Installing Unsloth and dependencies#

3. Choosing a Base Model and Use Case#

4. Preparing Your Dataset (Zero to Clean SFT Data)#

4.1 Recommended data format#

4.2 Example: loading and cleaning with datasets#

4.3 Formatting prompts for chat models#

5. Loading a Model with Unsloth#

5.1 Quickstart: QLoRA with Unsloth#

5.2 Configuring LoRA adapters#

6. Training: Step-by-Step Fine-Tuning#

6.1 Training arguments#

6.2 Running the training loop#

6.3 Monitoring and debugging training#

7. Evaluating Your Fine-Tuned Model#

8. Saving, Merging, and Publishing Models#

8.1 Saving LoRA adapters#

8.2 (Optional) Merging LoRA into a full model#

8.3 Pushing to the Hugging Face Hub#

9. Deploying to Production#

9.1 Serving with vLLM#

9.2 Hugging Face Inference Endpoints#

9.3 Basic production checklist#

10. Best Practices and Common Pitfalls#

Data & training#

Hyperparameters#

Unsloth specifics#

Deployment#

Conclusion#

Table of contents