Your data as differentiator in GenAI Era

Anastasia Karavdina
2 days ago
4 min read

photo by Maria Kostyleva (IG:masha_and_film)

With LLMs, we are approaching a time when all public data is already being utilized. And if you’ve ever tried to use a public LLM to answer questions about your organization, you’ve probably noticed how limited its knowledge is: it often knows very little about your specific domain or internal processes. Which, frankly, is often a good thing. Otherwise, it would be a bit scary!

There’s also another reason why more and more companies are choosing to invest in their own LLMs rather than just relying on public ones.

The authors of "AI Values Creators" nailed it pretty well:

"Imagine that we give you a glass of water (an LLM) and your intent is to add lemon juice and sugar (we’ll consider this your enterprise data) with the goal of making lemonade. If we gave you an opaque glass full of water (an LLM for which you know nothing about the data, and when you ask where did we get the water from, you’re not given any straight answers), would you feel comfortable using it with your fresh lemons and expensive organic cane sugar? Think about it: the glass is opaque, you can’t even see inside it! The water inside that glass could be pure spring water, but it could also be cloudy and murky puddle water, or even contaminated water! If you couldn’t see inside that glass, would you still drink what’s inside it after adding tons of high-quality sugar and lemon to it? Probably not, so why would you do this with one of your company’s most previous assets — your data?"

What are the ways to ensure you're using “pure spring water” with your precious data?

Look for model suppliers who are transparent about the data used to train their LLMs For example, IBM Granite has published a list of its data sources. And hopefully it will become standard in the future.
Fine-tune Open Source model with your data There are several ways to fine-tune language models. The most direct is supervised fine-tuning (SFT), where all parameters are updated. In contrast, parameter-efficient fine-tuning (PEFT) methods update only a small subset—making them more lightweight and modular. A well-known PEFT method is LoRA (Low-Rank Adaptation), which trains small, external modules that plug into the base model. These modules can be swapped in and out depending on the task.
Ultimately, the fine-tuning approach you choose will come down to a trade-off between performance and cost. The more parameters you update, the better the model tends to perform—but also the more expensive it is to train and deploy. Fine-tuning is powerful when you want to adapt a model using proprietary data, but it comes with a drawback known as catastrophic forgetting.
Once a model is fine-tuned on a specific task, it becomes highly specialized: great at that one thing, but it may lose some of its general-purpose capabilities. In practice, this means you’ll need to manage separate fine-tuned versions for each use case or, if you're using something like LoRA, maintain a separate adapter per task.
InstructLab InstructLab is an open-source framework developed by IBM and Red Hat to simplify and democratize the fine-tuning of large language models (LLMs). It is built upon the LAB (Large-scale Alignment for chatBots) method, which facilitates efficient and collaborative model enhancement through the following key components:
- Taxonomy-Driven Data Curation Contributors define new skills or knowledge areas using a structured YAML format. These definitions are organized within a hierarchical taxonomy, allowing for easy identification of knowledge gaps and systematic expansion of the model's capabilities.
- Synthetic Data Generation Based on the curated taxonomy entries, InstructLab generates high-quality synthetic data to train the model. This approach reduces the reliance on large volumes of human-generated data, making the fine-tuning process more efficient and cost-effective.
- Phased Training and Alignment The model undergoes a multi-phase training process that integrates the new synthetic data. This method ensures that the model acquires the new skills without overwriting existing knowledge, thereby avoiding catastrophic forgetting.
- Community-Driven Contributions InstructLab operates on an open-source model where contributors can submit their enhancements via pull requests. These contributions are reviewed and, upon approval, integrated into the community model, which is periodically updated and shared on platforms like Hugging Face.
RAG (Retrieval-Augmented Generation)

RAG is one of the most widely used patterns in enterprise LLM deployments today. It’s a practical way to inject enterprise-specific knowledge into a model without touching its weights.

The core idea is simple: when a user submits a query, it’s used to retrieve relevant information, usually via semantic search over a vector database (though traditional or hybrid setups work too). This retrieved context is then appended to the original query and sent to the LLM. The model answers using both its pre-trained knowledge and the provided context.

RAG is great for scenarios where up-to-date information matters: you can update your data store without retraining the model. But it’s not without trade-offs. RAG setups are systems, not just models: they involve pipelines, data stores, and orchestration. And because the model doesn’t actually learn the new information, the same context has to be re-sent every time, which can drive up inference costs.

As you can see, there’s no one-size-fits-all solution for integrating organizational knowledge into LLMs. Each method — whether it's RAG, fine-tuning, or adapter-based approaches — has its strengths and limitations. The companies that will succeed are the ones combining these techniques thoughtfully, depending on the use case. So yes, start with RAG if it makes sense, but don’t stop there. It’s time to think beyond RAG.

This post is inspired by Chapter 8 of "AI Values Creators", if you liked this post, you will enjoy the book!

Your data as differentiator in GenAI Era

Recent Posts

댓글

or fill up the form: