RAG in Production: Building Retrieval-Augmented Generation Systems That Actually Work
A production engineering guide to RAG — covering chunking strategies, embedding models, vector stores, retrieval quality metrics, and the architecture decisions that separate reliable RAG from hallucinating ones.

Retrieval-Augmented Generation (RAG) is the most practical technique for grounding LLMs in domain-specific knowledge without fine-tuning. But the gap between a toy RAG demo and a production RAG system is enormous.
This post covers what I learned building RAG pipelines at Edza.ai for educational content retrieval — including the chunking strategies, retrieval quality metrics, and failure modes that tutorials never mention.
RAG Architecture: The Full Pipeline
- Document Processing — extract, clean, structure raw content
- Chunking — split documents into retrievable units
- Embedding — convert chunks to vector representations
- Indexing — store in a vector database with metadata
- Retrieval — given a query, find the most relevant chunks
- Augmentation — build a context-rich prompt with retrieved chunks
- Generation — LLM generates grounded response
Chunking: The Most Underrated Decision
How you chunk documents determines retrieval quality more than any other factor. Three strategies, each with different tradeoffs:
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
TokenTextSplitter
)
# Strategy 1: Fixed-size with overlap (baseline)
def fixed_chunker(text: str, chunk_size=512, overlap=50) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["
", "
", ". ", " "]
)
return splitter.split_text(text)
# Strategy 2: Semantic chunking (split at meaning boundaries)
def semantic_chunker(text: str) -> list[str]:
"""Split at paragraph/section boundaries, not arbitrary character counts."""
# Split on double newlines (paragraph breaks) first
paragraphs = text.split("
")
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) < 1000:
current_chunk += para + "
"
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para + "
"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
# Strategy 3: Token-based (for models with strict context windows)
def token_chunker(text: str, tokens_per_chunk=256) -> list[str]:
splitter = TokenTextSplitter(chunk_size=tokens_per_chunk, chunk_overlap=20)
return splitter.split_text(text)
Embedding Model Selection
Not all embedding models are equal for retrieval. The MTEB leaderboard is the reference, but domain-specific performance varies:
- all-mpnet-base-v2 — best general-purpose, 768 dims, ~420MB
- all-MiniLM-L6-v2 — fastest, 384 dims, ~80MB — good for real-time
- text-embedding-3-large (OpenAI) — best absolute quality, API cost
- instructor-xl — task-aware embeddings, best for specialized domains
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingEngine:
def __init__(self, model_name="all-mpnet-base-v2"):
self.model = SentenceTransformer(model_name)
def embed_chunks(self, chunks: list[str]) -> np.ndarray:
return self.model.encode(
chunks,
batch_size=64,
show_progress_bar=True,
normalize_embeddings=True # Normalize for cosine similarity
)
def embed_query(self, query: str) -> np.ndarray:
return self.model.encode([query], normalize_embeddings=True)[0]
Retrieval Quality: The Metrics That Matter
def evaluate_retrieval(queries: list[str],
expected_doc_ids: list[str],
retriever) -> dict:
"""
Recall@K: of all relevant docs, what fraction did we retrieve?
MRR: how high did the first relevant doc rank?
"""
recall_scores = []
mrr_scores = []
for query, expected_id in zip(queries, expected_doc_ids):
retrieved = retriever.search(query, top_k=5)
retrieved_ids = [r["id"] for r in retrieved]
# Recall@5
recall = 1 if expected_id in retrieved_ids else 0
recall_scores.append(recall)
# MRR
if expected_id in retrieved_ids:
rank = retrieved_ids.index(expected_id) + 1
mrr_scores.append(1.0 / rank)
else:
mrr_scores.append(0)
return {
"recall_at_5": np.mean(recall_scores),
"mrr": np.mean(mrr_scores)
}
The 3 RAG Failure Modes (and How to Fix Them)
1. Retrieval returns irrelevant chunks
Fix: Improve chunking (semantic over fixed-size), add metadata filtering, use hybrid search (dense + BM25)
2. Correct chunk retrieved but answer still hallucinated
Fix: Add explicit grounding instruction: "Answer ONLY based on the provided context. If the answer is not in the context, say so."
3. Context window exceeded with too many chunks
Fix: Implement a re-ranking step — retrieve 20, re-rank with a cross-encoder, keep top 3
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[dict], top_k=3) -> list[dict]:
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]


