Back to Blog
LLM & NLP·RAGLLMVector Search

RAG in Production: Building Retrieval-Augmented Generation Systems That Actually Work

A production engineering guide to RAG — covering chunking strategies, embedding models, vector stores, retrieval quality metrics, and the architecture decisions that separate reliable RAG from hallucinating ones.

Rishabh Bhartiya10 min read
RAG in Production: Building Retrieval-Augmented Generation Systems That Actually Work

Retrieval-Augmented Generation (RAG) is the most practical technique for grounding LLMs in domain-specific knowledge without fine-tuning. But the gap between a toy RAG demo and a production RAG system is enormous.

This post covers what I learned building RAG pipelines at Edza.ai for educational content retrieval — including the chunking strategies, retrieval quality metrics, and failure modes that tutorials never mention.

RAG Architecture: The Full Pipeline

  1. Document Processing — extract, clean, structure raw content
  2. Chunking — split documents into retrievable units
  3. Embedding — convert chunks to vector representations
  4. Indexing — store in a vector database with metadata
  5. Retrieval — given a query, find the most relevant chunks
  6. Augmentation — build a context-rich prompt with retrieved chunks
  7. Generation — LLM generates grounded response

Chunking: The Most Underrated Decision

How you chunk documents determines retrieval quality more than any other factor. Three strategies, each with different tradeoffs:


from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

# Strategy 1: Fixed-size with overlap (baseline)
def fixed_chunker(text: str, chunk_size=512, overlap=50) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["

", "
", ". ", " "]
    )
    return splitter.split_text(text)

# Strategy 2: Semantic chunking (split at meaning boundaries)
def semantic_chunker(text: str) -> list[str]:
    """Split at paragraph/section boundaries, not arbitrary character counts."""
    # Split on double newlines (paragraph breaks) first
    paragraphs = text.split("

")
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < 1000:
            current_chunk += para + "

"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "

"

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Strategy 3: Token-based (for models with strict context windows)
def token_chunker(text: str, tokens_per_chunk=256) -> list[str]:
    splitter = TokenTextSplitter(chunk_size=tokens_per_chunk, chunk_overlap=20)
    return splitter.split_text(text)

Embedding Model Selection

Not all embedding models are equal for retrieval. The MTEB leaderboard is the reference, but domain-specific performance varies:

  • all-mpnet-base-v2 — best general-purpose, 768 dims, ~420MB
  • all-MiniLM-L6-v2 — fastest, 384 dims, ~80MB — good for real-time
  • text-embedding-3-large (OpenAI) — best absolute quality, API cost
  • instructor-xl — task-aware embeddings, best for specialized domains

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingEngine:
    def __init__(self, model_name="all-mpnet-base-v2"):
        self.model = SentenceTransformer(model_name)

    def embed_chunks(self, chunks: list[str]) -> np.ndarray:
        return self.model.encode(
            chunks,
            batch_size=64,
            show_progress_bar=True,
            normalize_embeddings=True  # Normalize for cosine similarity
        )

    def embed_query(self, query: str) -> np.ndarray:
        return self.model.encode([query], normalize_embeddings=True)[0]

Retrieval Quality: The Metrics That Matter


def evaluate_retrieval(queries: list[str],
                        expected_doc_ids: list[str],
                        retriever) -> dict:
    """
    Recall@K: of all relevant docs, what fraction did we retrieve?
    MRR: how high did the first relevant doc rank?
    """
    recall_scores = []
    mrr_scores = []

    for query, expected_id in zip(queries, expected_doc_ids):
        retrieved = retriever.search(query, top_k=5)
        retrieved_ids = [r["id"] for r in retrieved]

        # Recall@5
        recall = 1 if expected_id in retrieved_ids else 0
        recall_scores.append(recall)

        # MRR
        if expected_id in retrieved_ids:
            rank = retrieved_ids.index(expected_id) + 1
            mrr_scores.append(1.0 / rank)
        else:
            mrr_scores.append(0)

    return {
        "recall_at_5": np.mean(recall_scores),
        "mrr": np.mean(mrr_scores)
    }

The 3 RAG Failure Modes (and How to Fix Them)

1. Retrieval returns irrelevant chunks

Fix: Improve chunking (semantic over fixed-size), add metadata filtering, use hybrid search (dense + BM25)

2. Correct chunk retrieved but answer still hallucinated

Fix: Add explicit grounding instruction: "Answer ONLY based on the provided context. If the answer is not in the context, say so."

3. Context window exceeded with too many chunks

Fix: Implement a re-ranking step — retrieve 20, re-rank with a cross-encoder, keep top 3


from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_k=3) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Tags

RAGLLMVector SearchEmbeddingsProduction AIHuggingFace

Author

Rishabh Bhartiya

ML Engineer · NatrajX

Related Posts

All posts