We were wrong about fine-tuning.

Anastasia Karavdina
2 hours ago
4 min read

Not completely wrong, perhaps, but wrong in the way people are often wrong when they look at an early technology and extrapolate its future too directly from its first limitations.

A few years ago, when LLMs started moving from research demos into real business conversations, many of us had a fairly clear idea of what would happen next. The generic models were impressive, sometimes shockingly so, but they were also obviously not enough. They did not know our internal processes, they did not understand our company-specific language, they did not have access to the latest policies, they did not write exactly in our preferred tone, and they could not reliably distinguish between what sounded plausible and what was actually true in our business context.

So the conclusion felt almost obvious: sooner or later, every serious company would need its own model, or at least its own fine-tuned version of a foundation model.

It was an attractive idea, because it made enterprise AI sound like something you could own. A bank would have a banking model, an energy company would have an energy model, a retailer would have a retail model, and each organization would slowly move away from generic AI toward something trained on its documents, its language, its processes, its customers, and its way of working. Fine-tuning felt like the natural step from experimentation to maturity, from public demo to internal product, from “we use AI” to “we have our own AI.”

And for a while, this thinking shaped many conversations.

But then the reality of building GenAI systems inside companies turned out to be more complicated than the early “we just need to fine-tune the model” narrative suggested.

At first, it was easy to assume that the model itself was the problem, because when a prototype sounded too generic, missed internal nuance, or failed to follow company-specific formats, the natural conclusion was that the model needed to be adapted more deeply to the business.

But once these systems moved beyond demos, it became clear that many failures were not caused by the model being too generic, but by the system around it being too weak: the right context was missing, the relevant information was scattered across documents and systems, business processes had too many exceptions, and prompts were often expected to compensate for missing architecture, unclear requirements, and the absence of proper evaluations.

This is where RAG turned out to be more powerful than fine-tuning for many enterprise use cases.

Instead of forcing the model to memorize internal policies, product documentation, legal wording, customer rules, or operational procedures, RAG allows the model to retrieve the relevant knowledge from the current source of truth and use it at the moment of answering.

That difference matters, because fine-tuning mainly changes how the model behaves, while RAG changes what the model can access when it needs to answer.

And in companies, knowledge is never static: policies change, products change, processes change, regulations change, and yesterday’s correct answer can quickly become outdated.

If this knowledge is baked into a fine-tuned model, every business change becomes a model maintenance problem.

With RAG, the knowledge stays where it belongs: in systems that can be updated, versioned, governed, searched, monitored, and improved, while the model uses the right context when it needs it.

That is why RAG became such a powerful pattern in enterprise AI: not because it is perfect, but because it fits the reality of companies much better than the idea that every organization should encode its changing knowledge into a fine-tuned model.

At the same time, the models themselves kept improving faster than many of us expected. Context windows became longer, reasoning became stronger, structured outputs became easier to enforce, tool use became more reliable, and model efficiency improved across the market. The gap that fine-tuning was supposed to close did not disappear completely, but for many practical business applications, it became smaller and smaller before most companies had even reached the point where fine-tuning truly made operational sense.

So we never really arrived at the future where every company needed its own fine-tuned model.

Instead, we arrived at a more sober and probably healthier conclusion: most companies do not need “their own model” as much as they need better AI systems.

BloombergGPT is a good example of how the conversation looked in 2023. Bloomberg built a 50-billion-parameter language model for finance, trained on a mix of financial and general-purpose data, including hundreds of billions of tokens from Bloomberg’s own financial data sources. At that moment, it looked like a very plausible future: if finance is complex enough, regulated enough, and language-heavy enough, then surely a major financial information company would need its own domain-specific foundation model. And BloombergGPT did show strong results on financial NLP tasks, so the idea was not irrational. But the interesting part is what happened next: shortly after, newer generic models such as GPT-4 were reported to outperform BloombergGPT on many financial NLP benchmarks, which made the whole story much more nuanced. BloombergGPT was not a bad idea, but it was a product of a moment when many people believed domain-specific models would become the default enterprise path, while the next wave of progress showed that for most companies, domain-specific knowledge is often better handled through RAG and system design than through training a new model from scratch.

Earlier this month, OpenAI announced that it will deprecate its self-serve fine-tuning functionality. To me, this is an important signal. Not because fine-tuning has no value anymore, but because there may simply not be enough broad business demand to keep it as a default, self-serve capability.

Fine-tuning will still make sense in some cases, especially when the task is narrow, stable, repetitive, measurable, and supported by high-quality examples. It can still be useful for specific output behavior, classification, structured extraction, style consistency, latency-sensitive flows, or situations where the remaining gap is clearly about model behavior rather than missing context.

But that is a much smaller niche than many of us imagined a few years ago.

We were wrong about fine-tuning.

Recent Posts

Comments

or fill up the form: