LoRA and QLoRA in production: a practitioner's notes

Full fine-tuning a 7B model costs thousands in compute and takes days. LoRA does it in hours for under $50. QLoRA brings that down to a single GPU. Here's what actually works in production — what breaks, what surprises you, and where these methods shine.

Who this is forYou have a specific task (classification, extraction, domain Q&A) where prompting isn't reliable enough. You want a small, fast, cheap model that does one thing well. This is your guide to getting there with parameter-efficient fine-tuning.

The core idea: train 0.1% of the parameters

LoRA (Low-Rank Adaptation) doesn't modify the original model weights. Instead, it adds small trainable matrices alongside the existing layers. During inference, these get merged — zero additional latency, fraction of the training cost.

The math (simplified)

Original weight matrix W: 4096 × 4096 = 16.7M parameters

LoRA decomposition: A (4096 × 16) + B (16 × 4096) = 131K parameters

Result: 0.8% of original parameters trained. Same effective capacity for narrow tasks.

LoRA vs QLoRA: when to use which

Factor	LoRA	QLoRA
GPU memory (7B model)	~28GB (A100 needed)	~12GB (RTX 4090 works)
Training speed	~2 hours (1K examples)	~4 hours (same data)
Quality vs full fine-tune	~97% equivalent	~95% equivalent
Cost (cloud GPU)	$30-80 per run	$10-25 per run
Best for	Max quality, A100 available	Budget-constrained, consumer GPU

What works: lessons from 30+ production deployments

✓

500–2000 high-quality examples is the sweet spot

Below 500, you underfit. Above 5000, you get diminishing returns. Quality of examples matters 10× more than quantity.

✓

Rank 16–32 for most tasks

Higher rank = more parameters = more capacity but slower training. Rank 64+ rarely helps unless your task is highly complex.

✓

Target all linear layers, not just attention

Original LoRA paper targeted q_proj/v_proj only. In practice, adding k_proj, o_proj, and MLP layers gives 5-8% quality boost.

What breaks: the failure modes

Catastrophic forgetting on general tasks

Your fine-tuned model gets amazing at your task but forgets how to follow basic instructions. Fix: keep 10-20% general instruction data in training mix.

Overfitting on small datasets

200 examples with 3 epochs = memorized, not learned. Use early stopping on a held-out validation set. Watch eval loss diverge from train loss.

QLoRA quantization artifacts in generation

4-bit quantization can introduce subtle repetition patterns or truncation. Always eval with the merged (dequantized) model, not the quantized training checkpoint.

The production checklist

Curate 500+ examples that represent your EXACT production distribution

Include 15% negative/edge-case examples (what NOT to generate)

Train with rank 16-32, all linear layers, learning rate 2e-4

Use cosine scheduler with 10% warmup steps

Evaluate on held-out set every 50 steps — stop at eval loss minimum

Merge adapters into base model for inference (no overhead)

A/B test merged model against prompted baseline on 200+ real inputs

Monitor output distribution drift in production (weekly eval runs)

Parameter-efficient fine-tuning isn't magic — it's engineering. The models that work in production are the ones trained on carefully curated data, evaluated rigorously, and monitored continuously. LoRA just makes the iteration cycle fast enough to actually do this properly.