LoRA vs QLoRA: A Practical Guide to Efficient LLM Fine‑Tuning

Introduction

As large language models (LLMs) have grown into the tens and hundreds of billions of parameters, full fine‑tuning has become prohibitively expensive for most practitioners. Two techniques—LoRA and QLoRA—have emerged as leading approaches for parameter-efficient fine‑tuning (PEFT), enabling high‑quality adaptation on modest hardware.

They are related but distinct:

LoRA (Low-Rank Adaptation) introduces small trainable matrices on top of a frozen full‑precision model.
QLoRA combines 4‑bit quantization of the base model with LoRA adapters, making it possible to fine‑tune huge models (e.g., 65B) on a single 24–48 GB GPU.

This article walks through:

What LoRA is and how it works
What QLoRA is and how it differs from “plain” LoRA
Memory/computation trade‑offs
When to use LoRA vs QLoRA
Practical implementation with Hugging Face Transformers & PEFT

1. Why Parameter‑Efficient Fine‑Tuning?

Before comparing LoRA and QLoRA, it helps to understand the core problem they solve.

1.1 The cost of full fine‑tuning

For a standard transformer LLM:

Parameters: billions (e.g., 7B, 13B, 70B)
Storage: fp16 or bf16 weights (2 bytes/parameter)
Training:
- All parameters updated
- All optimizer states stored (often 2–4× model size)

Example: a 13B parameter model in fp16

Weights:
13B × 2 bytes ≈ 26 GB
Adam optimizer states (m and v):
2 × 26 GB = 52 GB
Gradients: another ≈ 26 GB

So just for one 13B model, you might need ~100+ GB of GPU memory to fine‑tune with Adam. This is infeasible on a single GPU and cumbersome even on multi‑GPU setups.

1.2 Parameter‑efficient methods

Parameter‑efficient methods aim to:

Freeze the bulk of the model weights
Add a small number of trainable parameters
Reduce memory and compute, while matching or approaching full‑tuning quality

LoRA and QLoRA are two such methods, but they target slightly different bottlenecks:

LoRA: reduce the number of trainable parameters and optimizer state
QLoRA: also reduce the memory footprint of the frozen base model via quantization

2. LoRA: Low‑Rank Adaptation

2.1 High‑level idea

LoRA (Hu et al., 2021) modifies a pre‑trained model by introducing low‑rank adapters into certain weight matrices (typically attention and sometimes MLP layers). Instead of updating the full matrix ( W \in \mathbb{R}^{d_\text{out} \times d_\text{in}} ), LoRA learns a low‑rank update:

[ W’ = W + \Delta W, \quad \Delta W = B A ]

where:

( W ): frozen pre‑trained weight matrix
( A \in \mathbb{R}^{r \times d_\text{in}} )
( B \in \mathbb{R}^{d_\text{out} \times r} )
( r ): low rank (e.g., 4, 8, 16)

Only A and B are trained; ( W ) remains fixed.

This drastically reduces the number of trainable parameters, especially for large ( d_\text{in}, d_\text{out} ).

2.2 Where LoRA is applied

In transformer LLMs, LoRA is typically applied to:

Attention projection matrices (e.g., q_proj, k_proj, v_proj, o_proj)
Sometimes MLP projections (gate_proj, up_proj, down_proj)

The intuition:

Attention layers strongly influence how the model routes information
A low‑rank update here can capture task‑specific adaptations cost‑effectively

2.3 Memory and compute benefits

With LoRA:

Base model weights are unchanged and stored in standard precision (fp16/bf16).
Trainable parameters are limited to the LoRA adapters (often <1–5% of full model).
Optimizer states scale with adapter size, not full model size.

This yields:

Much lower GPU memory for training (vs full fine‑tuning)
Faster training and iteration
Easy task‑specific adapters (you can store/swap only the adapters)

However, the full‑precision base model still needs to fit in GPU memory.

2.4 Pros and cons of LoRA

Pros

Significant reduction in trainable parameters and optimizer memory
Near full‑fine‑tuning performance for many tasks
Modular: you can train and distribute only the adapter weights
Well supported by libraries (Hugging Face PEFT, etc.)

Cons

Base model remains full precision; large models (e.g., 65B) still don’t fit on a single consumer GPU
Memory footprint is still dominated by frozen base weights
Inference memory/time is similar to base model + small overhead (unless you merge adapters)

3. QLoRA: Quantized LoRA

3.1 What is QLoRA?

QLoRA (Dettmers et al., 2023) extends LoRA by:

Quantizing the pre‑trained model to 4‑bit precision (usually NF4)
Keeping those 4‑bit weights frozen during training
Adding LoRA adapters on top, which are trained in higher precision (e.g., 16‑bit)
Using techniques like double quantization and paged optimizers to further save memory

The key insight: you can keep the large base model in low‑precision, but keep the updates (LoRA adapters) in higher precision so training is stable and accurate.

3.2 4‑bit quantization (NF4)

QLoRA uses a special 4‑bit quantization data type:

NF4 (NormalFloat4): a quantization scheme tailored to approximate a normal distribution, empirically better than uniform or standard int4 for LLM weights.
Implementation commonly via bitsandbytes (e.g., bnb.nn.Linear4bit).

This reduces the memory needed for the frozen base parameters by roughly 4× compared to fp16:

fp16: 16 bits per parameter
4‑bit: 4 bits per parameter

So a 13B model:

fp16 weights: ~26 GB
4‑bit weights: ~6.5 GB

This is a huge win for fitting big models on a single GPU.

3.3 Double quantization and paged optimizers

QLoRA introduces two additional techniques:

Double quantization
- The quantization constants (scales) used for 4‑bit quantization are themselves quantized to 8‑bit.
- This further reduces memory overhead from storing these metadata values.
Paged optimizers
- Implemented in bitsandbytes (paged Adam, etc.)
- Allow storing optimizer states more efficiently and paging them between CPU and GPU memory as needed.
- Critical for training very large models when GPU VRAM is limited.

Combined, these allow fine‑tuning models up to 65B parameters on a single 48 GB GPU (or similar).

3.4 How QLoRA differs from “just LoRA + quantization”

You could imagine:

Load a quantized model (e.g., int8 or 4‑bit)
Add LoRA adapters on top
Train

QLoRA is more than that:

It systematically uses 4‑bit NF4 quantization
Carefully handles backpropagation through quantized weights (via dequantization in compute graph)
Uses double quantization & paged optimizers tuned for this workflow
Demonstrated that, despite 4‑bit quantization, performance can match or exceed 16‑bit LoRA fine‑tuning in many benchmarks

4. LoRA vs QLoRA: Conceptual Comparison

4.1 Conceptual summary

LoRA:
- Frozen base model in standard precision (fp16/bf16)
- Low‑rank adapters in trainable precision (fp16/bf16)
- Goal: reduce trainable parameters and optimizer memory
QLoRA:
- Frozen base model in 4‑bit quantized precision
- Low‑rank adapters in trainable precision (fp16/bf16)
- Goal: also reduce base model memory, enabling much larger models on limited GPUs

4.2 Comparison table

Aspect	LoRA	QLoRA
Base model precision	fp16 / bf16 (or fp32)	4‑bit (NF4)
Base model trainable?	Frozen (usually)	Frozen (quantized)
Additional parameters	Low‑rank adapters (LoRA)	Same (LoRA adapters)
Trainable parameter count	« full model (e.g., <1–5%)	Same order as LoRA
Memory for base weights	High (16‑bit or 32‑bit)	Much lower (4‑bit)
Optimizer state memory	Proportional to adapter size	Proportional to adapter size (with paged optimizers)
Hardware requirement	Enough VRAM for 16‑bit base model	Much smaller VRAM; 7–70B on a single GPU feasible
Training speed	Fast	Sometimes slower (4‑bit operations & overhead)
Implementation complexity	Moderate / simple	Higher (quantization, bitsandbytes, configs)
Performance (quality)	Excellent; near full fine‑tune	Very competitive; sometimes slightly lower, often similar

5. Memory & Compute Trade‑Offs

5.1 Rough memory estimates

Let:

( N ) = number of parameters in base model
Precision bits = 16 (fp16) or 4 (NF4)
Base model memory ≈ ( N \times \text{bits} / 8 ) bytes

Example (ignoring overhead, embeddings, etc.):

7B model

fp16 base:
( 7 \times 10^9 \times 2 ) bytes ≈ 14 GB
4‑bit base:
( 7 \times 10^9 \times 0.5 ) bytes ≈ 3.5 GB

13B model

fp16: ≈ 26 GB
4‑bit: ≈ 6.5 GB

65B model

fp16: ≈ 130 GB
4‑bit: ≈ 32.5 GB

Now consider LoRA adapter size. Suppose we add rank‑r LoRA to a subset of layers. The total LoRA parameter count is typically on the order of tens of millions, far smaller than billions.

So:

Memory for adapters + optimizer states is minor compared to base model for large ( N ).
With QLoRA, the base model is no longer the dominant memory consumer relative to GPU constraints.

5.2 Compute cost

Forward pass:
- LoRA adds a small matrix multiplication ( BAx ) on top of ( Wx )
- QLoRA must dequantize 4‑bit weights to higher precision for computation (internally), which adds overhead.
Backward pass:
- Gradients are computed only for the LoRA parameters (both LoRA and QLoRA).
- QLoRA may involve extra steps due to quantization and paged optimizers.

Empirically:

LoRA on a 7B model (fp16) can be very fast if your GPU has enough VRAM.
QLoRA on a larger model (e.g., 33B, 65B) will be slower per step but gives you access to much bigger models than otherwise possible.

6. Quality & Stability Considerations

6.1 Does quantization hurt performance?

In principle, 4‑bit quantization can degrade model quality. However:

The QLoRA paper shows that 4‑bit NF4 with LoRA often matches or exceeds 16‑bit LoRA fine‑tuning on many benchmarks.
Reasons:
- The base model is strong and only slightly approximated via 4‑bit NF4.
- The LoRA adapters (in higher precision) compensate for quantization errors during training.

Still, there are caveats:

Extreme quantization (e.g., 2‑bit) would likely hurt more.
Very small models might be more sensitive.
Tasks that require very fine‑grained nuance might see small degradations.

6.2 Training stability

QLoRA’s design choices optimize stability:

4‑bit NF4 is better aligned with LLM weight distributions than naive int4.
Optimizer hyperparameters are tuned for quantized setups.
Adapters remain in 16‑bit, preserving gradient fidelity.

Nonetheless:

You should monitor training loss and evaluation metrics carefully.
Use recommended configs from reference implementations when possible.

7. Practical Implementation: LoRA with Hugging Face PEFT

Below is a minimal example of fine‑tuning an LLM with LoRA using Hugging Face Transformers + PEFT (no quantization).

7.1 Installation

pip install transformers peft datasets accelerate bitsandbytes

(Even though bitsandbytes is not strictly required for non‑quantized LoRA, it’s often used for 8‑bit loading.)

7.2 Loading the base model in 16‑bit with LoRA

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_name = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # if needed

# Load model in 16-bit (bf16 or fp16)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",      # or torch.float16 / bfloat16
    device_map="auto",       # let accelerate spread across GPUs
)

# Define LoRA configuration
lora_config = LoraConfig(
    r=8,                      # rank
    lora_alpha=16,            # scaling
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",    # or SEQ_2_SEQ_LM, TOKEN_CLS, etc.
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

You would then:

Prepare your dataset (datasets library)
Use Trainer or accelerate to train while only the LoRA parameters are updated

7.3 Sample training loop with Trainer

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

training_args = TrainingArguments(
    output_dir="./lora-llama2-7b-wikitext",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
)

trainer.train()

This is a straightforward LoRA setup: base model in high precision, LoRA adapters trained.

8. Practical Implementation: QLoRA in Practice

With QLoRA, we:

Load the base model in 4‑bit via bitsandbytes
Attach LoRA adapters using PEFT
Train similarly, but with much lower memory usage

8.1 Loading a model in 4‑bit

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

model_name = "meta-llama/Llama-2-13b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,   # or torch.float16
    bnb_4bit_use_double_quant=True,         # double quantization
    bnb_4bit_quant_type="nf4",              # NF4 quantization
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

Here:

load_in_4bit=True tells transformers to use 4‑bit quantization.
bnb_4bit_compute_dtype is the dtype used for computation (e.g., bf16).
double_quant + nf4 align with QLoRA’s recommended configuration.

8.2 Attaching LoRA adapters (QLoRA style)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Only the LoRA parameters (in higher precision) will be updated. The 4‑bit weights remain frozen.

8.3 Example training configuration

With QLoRA, you often:

Use smaller batch sizes (to fit context + activations)
Rely on gradient accumulation
Possibly use bitsandbytes paged optimizers via transformers integration

A simplified Trainer setup (for illustration):

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

training_args = TrainingArguments(
    output_dir="./qlora-llama2-13b-wikitext",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=3,
    learning_rate=2e-4,
    bf16=True,             # or fp16, depending on GPU support
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
)

trainer.train()

Note: For fully faithful QLoRA setups, refer to official scripts (e.g., Hugging Face’s trl QLoRA examples or the original QLoRA repo). They include additional tuning (e.g., gradient checkpointing, specific LR schedules, and more).

9. When to Use LoRA vs QLoRA

9.1 Key decision factors

Consider:

GPU memory availability
Model size you want to fine‑tune
Desired training speed
Implementation complexity tolerance
Quality requirements and task nature

9.2 Typical scenarios

Scenario 1: You have a 24 GB GPU and want to fine‑tune a 7B model

Options:
- LoRA with fp16 base
- QLoRA with 4‑bit base

Recommendation:

Start with standard LoRA:
- Simpler and often faster
- 7B fp16 + LoRA usually fits comfortably in 24 GB with careful configs
Use QLoRA only if:
- You need extra memory headroom for longer context lengths or larger batch sizes

Scenario 2: You want to fine‑tune a 13B or 34B model on a single 24 GB GPU

Full fp16 model likely won’t fit, or will leave almost no room for activations.

Recommendation:

QLoRA is strongly preferred:
- 4‑bit quantization makes 13B / 34B feasible on a single GPU
- You retain competitive performance with reasonable training time

Scenario 3: You have multi‑GPU A100s (80 GB) and care about maximum performance

You can afford fp16 13B/34B or even larger.

Recommendation:

Either:
- Use LoRA on a larger model in fp16 for maximum quality and simpler tooling.
- Experiment with QLoRA if you want to push model sizes further (e.g., 70B) or explore memory savings to run large batch sizes.

Scenario 4: You need many task‑specific adapters and easy deployment

Both LoRA and QLoRA excel at modular adapters.

Deployment considerations:

For LoRA:
- Base model in fp16 / bf16
- Load desired LoRA adapter at inference time
For QLoRA:
- Base model in 4‑bit
- Load QLoRA adapter at inference time

If your inference environment has deeply constrained memory (e.g., a single 16 GB GPU), QLoRA may be better even at inference time.

9.3 Rule‑of‑thumb summary

If the base model fits comfortably in fp16 on your GPU and you prefer speed & simplicity → LoRA.
If the base model doesn’t fit or you want to fine‑tune very large models (13B–70B) on constrained hardware → QLoRA.
For research or production where every bit of performance matters and hardware is available → start with LoRA on a strong model, but keep QLoRA in mind if you hit memory ceilings.

10. Inference & Deployment Considerations

10.1 Using LoRA/QLoRA adapters at inference

With PEFT, you can:

Load the base model (fp16 or 4‑bit)
Load the adapter weights
Use the combined model for generation

Example (QLoRA inference):

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

base_model_name = "meta-llama/Llama-2-13b-hf"
adapter_path = "./qlora-llama2-13b-wikitext"  # directory with adapter weights

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

prompt = "Explain the concept of QLoRA in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

10.2 Merging LoRA adapters

For LoRA on a full‑precision base model, you can sometimes merge the LoRA weights into the base model for deployment:

This results in a single set of weights without needing adapter logic.
Useful if you want a “standalone” fine‑tuned model.
For QLoRA, merging into a 4‑bit base is more nuanced and not always desirable; often you just keep the adapter separate.

Example merging (for non‑quantized LoRA):

model = model.merge_and_unload()
model.save_pretrained("./merged-lora-model")
tokenizer.save_pretrained("./merged-lora-model")

Note: Merging with quantized models is less straightforward and often not used; instead, deployment keeps the quantized base + adapter.

11. Common Pitfalls and Best Practices

11.1 Pitfalls

Underestimating context memory:
Even with QLoRA, long context windows and large batch sizes can exceed VRAM.
Incorrect target_modules:
If you misconfigure which layers LoRA is applied to, you may see little improvement.
Learning rate too high:
Can lead to instability, especially with quantized weights. Stick to recommended ranges (e.g., 1e‑4–3e‑4 for LoRA/QLoRA).
Incompatible GPUs for bf16:
Use fp16 compute if your GPU doesn’t support bf16 well.

11.2 Best practices

Start with small rank (r=4 or 8) and increase only if necessary.
Use gradient checkpointing for large models to reduce activation memory.
Use evaluation checkpoints to monitor overfitting (even with small adapters, it happens).
Reuse community baselines:
- Hugging Face trl’s QLoRA examples for chat fine‑tuning
- Well‑tested hyperparameters from papers/blogs

12. Summary: LoRA vs QLoRA

To recap:

LoRA:
- Adds low‑rank trainable adapters on top of a frozen full‑precision model.
- Greatly reduces the number of trainable parameters and optimizer state.
- Best when the base model fits in GPU memory in fp16/bf16.
- Simple, fast, widely supported.
QLoRA:
- Uses 4‑bit NF4 quantization for the frozen base model plus LoRA adapters.
- Further reduces memory footprint, enabling fine‑tuning of very large models (13–70B) on modest GPUs.
- Slightly more complex, sometimes slower per step, but extremely memory‑efficient.
- Empirically competitive with full‑precision LoRA fine‑tuning.

Guiding choice:

If VRAM is not your main constraint → LoRA is often the easiest and fastest route.
If VRAM is the bottleneck and you want larger models or longer context → QLoRA unlocks capabilities that would otherwise require multi‑GPU or specialized hardware.

Both methods are complementary tools in the modern LLM practitioner’s toolkit. Understanding their trade‑offs lets you pick the right approach for your hardware, model size, and project requirements.

Introduction#

1. Why Parameter‑Efficient Fine‑Tuning?#

1.1 The cost of full fine‑tuning#

1.2 Parameter‑efficient methods#

2. LoRA: Low‑Rank Adaptation#

2.1 High‑level idea#

2.2 Where LoRA is applied#

2.3 Memory and compute benefits#

2.4 Pros and cons of LoRA#

3. QLoRA: Quantized LoRA#

3.1 What is QLoRA?#

3.2 4‑bit quantization (NF4)#

3.3 Double quantization and paged optimizers#

3.4 How QLoRA differs from “just LoRA + quantization”#

4. LoRA vs QLoRA: Conceptual Comparison#

4.1 Conceptual summary#

4.2 Comparison table#

5. Memory & Compute Trade‑Offs#

5.1 Rough memory estimates#

7B model#

13B model#

65B model#

5.2 Compute cost#

6. Quality & Stability Considerations#

6.1 Does quantization hurt performance?#

6.2 Training stability#

7. Practical Implementation: LoRA with Hugging Face PEFT#

7.1 Installation#

7.2 Loading the base model in 16‑bit with LoRA#

7.3 Sample training loop with Trainer#

8. Practical Implementation: QLoRA in Practice#

8.1 Loading a model in 4‑bit#

8.2 Attaching LoRA adapters (QLoRA style)#

8.3 Example training configuration#

9. When to Use LoRA vs QLoRA#

9.1 Key decision factors#

9.2 Typical scenarios#

9.3 Rule‑of‑thumb summary#

10. Inference & Deployment Considerations#

10.1 Using LoRA/QLoRA adapters at inference#

10.2 Merging LoRA adapters#

11. Common Pitfalls and Best Practices#

11.1 Pitfalls#

11.2 Best practices#

12. Summary: LoRA vs QLoRA#

Introduction

1. Why Parameter‑Efficient Fine‑Tuning?

1.1 The cost of full fine‑tuning

1.2 Parameter‑efficient methods

2. LoRA: Low‑Rank Adaptation

2.1 High‑level idea

2.2 Where LoRA is applied

2.3 Memory and compute benefits

2.4 Pros and cons of LoRA

3. QLoRA: Quantized LoRA

3.1 What is QLoRA?

3.2 4‑bit quantization (NF4)

3.3 Double quantization and paged optimizers

3.4 How QLoRA differs from “just LoRA + quantization”

4. LoRA vs QLoRA: Conceptual Comparison

4.1 Conceptual summary

4.2 Comparison table

5. Memory & Compute Trade‑Offs

5.1 Rough memory estimates

7B model

13B model

65B model

5.2 Compute cost

6. Quality & Stability Considerations

6.1 Does quantization hurt performance?

6.2 Training stability

7. Practical Implementation: LoRA with Hugging Face PEFT

7.1 Installation

7.2 Loading the base model in 16‑bit with LoRA

7.3 Sample training loop with Trainer

8. Practical Implementation: QLoRA in Practice

8.1 Loading a model in 4‑bit

8.2 Attaching LoRA adapters (QLoRA style)

8.3 Example training configuration

9. When to Use LoRA vs QLoRA

9.1 Key decision factors

9.2 Typical scenarios

9.3 Rule‑of‑thumb summary

10. Inference & Deployment Considerations

10.1 Using LoRA/QLoRA adapters at inference

10.2 Merging LoRA adapters

11. Common Pitfalls and Best Practices

11.1 Pitfalls

11.2 Best practices

12. Summary: LoRA vs QLoRA