LLM Fine-Tuning: When PEFT Makes Sense and When It Doesn't

LLM Fine-Tuning: When PEFT Makes Sense and When It Doesn’t

Fine-tuning has a reputation problem. Some see it as a magic wand, sure that feeding a model some company data will make it understand the domain. Others treat any change to the weights as a relic of the pre-prompting era. 

Both can be wrong simultaneously and the confusion is expensive. A misjudged training run can burn through a GPU budget and ship a model that performs worse than the one you started with.

Parameter-efficient fine-tuning, or PEFT, is what makes the question worth reopening. It sidesteps the brute-force cost of updating every weight by freezing the base model and training only a thin slice of new parameters. 

The result is a much better cost-to-capability curve, which is why fine-tuning is back on the table for teams that could never afford the full version. Cheaper is not the same as correct, though. 

This post is the detailed version of the decision: when PEFT is the right tool, and when it quietly sets you up to fail.

What is PEFT?

Full fine-tuning updates every weight in the model. That buys maximum flexibility at maximum cost, and it demands a large, clean dataset. 

It also carries a real risk of catastrophic forgetting, where the model loses general ability while learning your narrow task. PEFT takes the opposite bet. 

It leaves the pre-trained weights frozen and trains a small, targeted set of new parameters layered on top.

The dominant method is LoRA, short for Low-Rank Adaptation. Instead of learning a full weight update, it decomposes that update into two small low-rank matrices and trains only those. 

The original LoRA work showed this can cut trainable parameters by up to four orders of magnitude against full fine-tuning of a 175B-parameter model, while matching or beating its quality. 

With the learned update being tiny and stored separately, you can keep one frozen base model and swap lightweight adapters, often just a few megabytes each, for different tasks on the fly.

QLoRA pushes the same idea further by running LoRA on top of a base model quantised to 4-bit precision, which collapses the memory footprint. 

The QLoRA paper showed a 65B model could be tuned on a single 48GB GPU, down from the 780GB-plus a full-precision run of that size would demand, with almost no loss in quality. 

In practice the same approach brings 30B-class models within reach of a single 24GB consumer card. Other PEFT variants exist, including adapters inserted inside transformer blocks and prompt or prefix tuning that trains steering vectors rather than weights, but LoRA and QLoRA are what you will reach for most in production.

When PEFT Is the right tool

PEFT earns its place when the problem is about behaviour rather than missing facts, the kind of thing you would describe as consistency or a specialised skill. 

Four situations make the case clearly:

  • Format and structure enforcement – When you need the model to reliably emit a strict output shape, say a specific JSON schema or a function-call structure, prompting tends to get you most of the way and then fail intermittently. Fine-tuning is far more dependable, often moving format adherence from occasional to near-perfect once the behaviour is trained in. In production, “sometimes correctly formatted” is indistinguishable from broken.
  • Constrained hardware and budgets – This is QLoRA’s home turf. Without enterprise compute on hand, 4-bit quantisation makes tuning large open-weight models viable on hardware you already own, turning a six-figure infrastructure question into an overnight job on a single card. When the compute bill drops that far, the build-versus-buy maths shifts with it, which is the dynamic we examined in tokenomics and the compute economy.
  • Narrow task specialisation – When you want a model to do one thing well, like summarising in a legal or medical register or calling a specific set of tools, fine-tuning teaches that behaviour while preserving general capability. It also lets you push instructions out of the prompt and into the weights. That shortens prompts and cuts token cost, and it often lets a smaller, faster model match a larger one on your specific task.
  • Multi-task management – Rather than maintaining several large fine-tuned models, you train a lightweight adapter per task and swap them over a single frozen base. Storage and deployment stay cheap, and adding a task means adding a few megabytes rather than another full model.

The common thread is straightforward. Reach for PEFT when you can describe the win as making the model behave a certain way, consistently, and when you can show that behaviour in a few hundred to a few thousand clean examples.

When PEFT is the wrong choice

The most common and most expensive mistake is using fine-tuning to inject knowledge. If the goal is to teach the model your company’s private documents or any body of searchable facts, fine-tuning is the wrong instrument. 

Weights are poor at storing and faithfully reproducing specific facts. Getting them to memorise reliably tends to take enough training to start degrading other capabilities and a fine-tuned model does not track its sources or let you update a single fact without retraining.

For knowledge problems, use RAG. Retrieval-augmented generation keeps facts in an external store the model queries at inference time, which makes them cheaper to update and traceable back to a source. 

Plenty of strong systems combine the two, fine-tuning for behaviour and retrieving facts, but if you have to pick one for a knowledge task, retrieval wins almost every time.

A few other situations argue against any weight update at all.

  • Too little data – Under roughly 100 labelled examples, you will overfit or learn noise rather than the behaviour you intended. Prompting and retrieval are better first moves, and synthetic data can help fill a thin set.
  • A task that changes weekly – A fine-tune is a frozen artefact. If requirements shift constantly, every change becomes technical debt, so design a workflow instead of training a model.
  • No way to measure success – If you cannot build an evaluation that tells you whether the tuned model is actually better, do not train. Without metrics you are optimising blind, and a model that looks impressive in a notebook is not a product.
  • “Bigger must be better” instincts on small data – Larger models overfit harder when examples are scarce. Match model size to the data and lean on PEFT with regularisation rather than a full fine-tune.

A decision cheat sheet

Two things drive most of the decision: how much labelled data you have and how tight your hardware budget is. 

Underneath both sits the real question, whether you are fixing behaviour or filling in knowledge. The chooser below is directional rather than gospel and your own evaluation always overrides it.

  • Fewer than 100 labelled examples: don’t fine-tune at all. Lean on prompting and retrieval, with synthetic data to extend a thin set.
  • Roughly 100 to 1,000 examples: the sweet spot for PEFT, so use LoRA or adapters.
  • Around 1,000 to 10,000 examples: LoRA still works well, or move to a full fine-tune with a small learning rate.
  • 10,000 or more examples: full fine-tuning can start to pay off, but only if your evaluation and regression testing are solid.
  • When VRAM is tight: reach for QLoRA, a 4-bit base with LoRA adapters trained on top.
  • When the gap is missing facts: use RAG first, and add a fine-tune only if behaviour also needs work.
  • When the task shifts often: avoid weight updates and build a workflow instead.

One thing the row counts hide: quality usually beats quantity. The QLoRA team found a small, well-curated instruction set outperformed a far larger but noisier one, so a sharp, representative dataset will often beat one several times its size. 

Curating what you already have tends to be a better use of effort than collecting more of it.

Treat fine-tuning as a system

Whichever way the decision goes, the teams that succeed treat fine-tuning as one piece of AI product development, an engineering loop rather than a one-off experiment: data, training, evaluation against a held-out set, deployment constraints, monitoring, then back to data. 

The traps that sink projects are systems problems rather than modelling ones. Data leakage makes validation look great while the model fails on real data and a latency budget gets ignored until the demo ships and the product feels slow.

Used well, PEFT gives the best cost-to-capability tradeoff available right now: real behavioural gains for a fraction of the compute, packaged as adapters you can deploy and swap. 

It pays off when you need consistent behaviour or a narrow specialisation and can show that behaviour in a clean dataset. It disappoints when the real gap is knowledge rather than behaviour or when the data is too thin and the success metric too vague to learn anything from.

Getting that judgment right is most of the work, and it is rarely a solo effort. The dataset work and the evaluation loop behind a reliable result are where SkyBiometry’s applied AI work shines, turning a generic model into a dependable specialist for one job.

Share: 

Contact us

Interested in our products, custom solutions, or partnership opportunities? Have questions about our technologies or need more information before purchasing? Fill out the form, and our team will get back to you as soon as possible.