RAG LLMs Technical AI Engineering

RAG Systems Explained: Build Knowledge Assistants That Don't Hallucinate

A practical deep-dive into Retrieval-Augmented Generation — how it works, when to use it, and the engineering decisions that separate production RAG from demo RAG.

Yash Garg
Senior AI Engineer & AI Automation Consultant
6 min read

Retrieval-Augmented Generation (RAG) is the most practical LLM architecture for enterprise use cases. It solves the two biggest problems with raw language models: hallucination and stale knowledge.

After building RAG systems for hospitals, law firms, and SaaS companies, I've learned what separates a demo that impresses in a pitch from a system that your team actually trusts at 2 AM.

What RAG Is (and Isn't)

RAG is not fine-tuning. Fine-tuning bakes knowledge into the model's weights — expensive, slow to update, and still prone to hallucination. RAG keeps knowledge external, retrieves relevant chunks at query time, and grounds the model's response in your actual documents.

The simplified flow:

  1. User asks a question
  2. The question is converted to a vector (embedding)
  3. A similarity search finds the most relevant document chunks
  4. Those chunks are injected into the LLM prompt as context
  5. The LLM generates a response grounded in that context

The magic is in step 3. When it works well, the model says "I found this in document X, section Y." When it fails, the model confabulates confidently. Everything interesting is in making step 3 work reliably.

The Components You Actually Need

Document Ingestion Pipeline

Before you can retrieve, you need to ingest. This is more complex than it looks:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("clinical_protocol.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(documents)

The key parameters are chunk_size and chunk_overlap. Too small and you lose context; too large and you retrieve too much noise. For most professional documents I use 400–600 tokens with 10–15% overlap.

The Vector Database

Your vector store is the retrieval engine. The main options:

Database Best for Hosted
Pinecone Simple, managed, reliable Yes
Weaviate Self-hosted, rich filtering Both
pgvector Already using Postgres Self-hosted
Chroma Local dev, prototyping Local

For production systems I default to Pinecone for its simplicity, or pgvector if the client already has Postgres — the operational overhead of a new database is often not worth it.

The Retrieval Step

Basic retrieval: cosine similarity search, return top-K chunks. This works for simple FAQs. For production, you need more:

Hybrid search combines vector similarity (semantic) with keyword search (BM25). Semantic search handles paraphrase; keyword search handles exact terminology. Most enterprise search requires both.

Metadata filtering lets you restrict retrieval to specific document types, dates, or departments before the vector search. For a hospital system, you'd filter by department first: {"department": "cardiology"}.

Re-ranking passes retrieved candidates through a cross-encoder that scores query-document relevance more precisely than the initial vector similarity. Adds latency but dramatically improves precision.

The Chunk Size Problem

This is where most RAG demos fail in production.

Small chunks (128 tokens) are precise for retrieval but lose surrounding context. The model might retrieve the answer to "what's the maximum dosage?" but without the sentence before it saying "for patients over 65..."

Large chunks (1024 tokens) preserve context but retrieve too much irrelevant text, degrading generation quality and wasting tokens.

The solution I use in production: hierarchical chunking. Store both summary-level and detail-level chunks. Use the summary for initial retrieval, then fetch the associated detail chunk for generation. More complex, but it solves the context problem at scale.

Evaluating Your RAG System

"It seems to be working" is not a production standard. You need systematic evaluation before you ship.

The minimal evaluation set you should build:

  • 50 representative questions that real users will ask
  • The expected answer for each (the ground truth)
  • The source document and section that contains the answer

Then evaluate three things:

  1. Retrieval recall — does the correct chunk appear in your top-K results? If it doesn't appear, the LLM can't answer correctly regardless of how good it is.
  2. Answer faithfulness — does the generated answer only use information from the retrieved context? Hallucination detection.
  3. Answer relevance — does the generated answer actually address the question asked?

Tools like RAGAS automate this evaluation. Use it.

What I Do Differently in Production

Confidence thresholds: If the similarity score of the top retrieved chunk is below 0.75, don't answer — say "I don't have information about this." Users trust a system that admits it doesn't know more than one that confidently makes things up.

Source citations: Every response includes the document name, section, and page number. This lets users verify answers and builds trust over time.

Query expansion: Before retrieval, generate 2–3 alternative phrasings of the question and retrieve for all of them. Users don't always phrase questions the way documents are written.

Caching: Identical or near-identical queries are common in enterprise settings. Cache both the retrieved chunks and the final response to reduce latency and cost.

When RAG Is the Wrong Choice

RAG is not the solution to every knowledge problem.

If your documents change faster than you can re-index, you'll serve stale results. If your documents are poorly structured or duplicate-heavy, retrieval quality will suffer. If the question requires complex multi-document reasoning across hundreds of chunks, a single RAG query won't get you there — you need agentic retrieval.

RAG works best when: your corpus is stable and well-organised, questions can be answered from a single document section, and you have the engineering capacity to evaluate retrieval quality continuously.

A Note on Prompt Design

The retrieval is only half the problem. The prompt that wraps your context chunks matters enormously:

You are a clinical knowledge assistant. Answer only using the provided context.
If the answer is not in the context, say "I don't have information about this."
Always cite the source document and section.

Context:
{retrieved_chunks}

Question: {user_question}

The instruction "answer only using the provided context" is not magic — models still hallucinate if the context is ambiguous or incomplete. But it significantly reduces hallucination compared to open-ended prompting.

Getting Started

If you're evaluating RAG for your business, the fastest path to a useful prototype is:

  1. Pick 20–30 representative documents from your corpus
  2. Build a basic ingestion pipeline with langchain and chroma (local, no cost)
  3. Create your 50-question evaluation set
  4. Measure retrieval recall before you do anything else

If retrieval recall is below 80%, no amount of prompt engineering will fix it. Fix retrieval first.

If you want to skip the learning curve and get a production-ready system built in weeks, let's talk.