Full fine-tuning a 7B model costs thousands in compute and takes days. LoRA does it in hours for under $50. QLoRA brings that down to a single GPU. Here's what actually works in production — what breaks, what surprises you, and where these methods shine.
The core idea: train 0.1% of the parameters
LoRA (Low-Rank Adaptation) doesn't modify the original model weights. Instead, it adds small trainable matrices alongside the existing layers. During inference, these get merged — zero additional latency, fraction of the training cost.
The math (simplified)
LoRA vs QLoRA: when to use which
| Factor | LoRA | QLoRA |
|---|---|---|
| GPU memory (7B model) | ~28GB (A100 needed) | ~12GB (RTX 4090 works) |
| Training speed | ~2 hours (1K examples) | ~4 hours (same data) |
| Quality vs full fine-tune | ~97% equivalent | ~95% equivalent |
| Cost (cloud GPU) | $30-80 per run | $10-25 per run |
| Best for | Max quality, A100 available | Budget-constrained, consumer GPU |
What works: lessons from 30+ production deployments
500–2000 high-quality examples is the sweet spot
Below 500, you underfit. Above 5000, you get diminishing returns. Quality of examples matters 10× more than quantity.
Rank 16–32 for most tasks
Higher rank = more parameters = more capacity but slower training. Rank 64+ rarely helps unless your task is highly complex.
Target all linear layers, not just attention
Original LoRA paper targeted q_proj/v_proj only. In practice, adding k_proj, o_proj, and MLP layers gives 5-8% quality boost.
What breaks: the failure modes
Catastrophic forgetting on general tasks
Your fine-tuned model gets amazing at your task but forgets how to follow basic instructions. Fix: keep 10-20% general instruction data in training mix.
Overfitting on small datasets
200 examples with 3 epochs = memorized, not learned. Use early stopping on a held-out validation set. Watch eval loss diverge from train loss.
QLoRA quantization artifacts in generation
4-bit quantization can introduce subtle repetition patterns or truncation. Always eval with the merged (dequantized) model, not the quantized training checkpoint.
The production checklist
Curate 500+ examples that represent your EXACT production distribution
Include 15% negative/edge-case examples (what NOT to generate)
Train with rank 16-32, all linear layers, learning rate 2e-4
Use cosine scheduler with 10% warmup steps
Evaluate on held-out set every 50 steps — stop at eval loss minimum
Merge adapters into base model for inference (no overhead)
A/B test merged model against prompted baseline on 200+ real inputs
Monitor output distribution drift in production (weekly eval runs)
Parameter-efficient fine-tuning isn't magic — it's engineering. The models that work in production are the ones trained on carefully curated data, evaluated rigorously, and monitored continuously. LoRA just makes the iteration cycle fast enough to actually do this properly.