LLM Evaluation in Production: Measuring Hallucination, Bias, and Semantic Correctness at Scale
How to build automated LLM evaluation pipelines that benchmark hallucination rate, semantic correctness, and demographic bias — covering the metrics, tools, and architecture for 50K+ prompts/day.

Shipping an LLM into production without an evaluation framework is like deploying code without tests. You don't know what you have until something breaks — in front of users.
This post covers the evaluation framework I built to benchmark LLM outputs across three dimensions: hallucination rate, semantic correctness, and demographic bias — running fully automatically at 50K+ prompts/day.
The Three Evaluation Dimensions
1. Hallucination Detection
Hallucination means the model states something factually incorrect with confidence. The challenge is detecting this automatically — you can't manually review 50K responses.
Our approach uses a cross-encoder NLI model to classify whether the generated response is entailed by, contradicts, or is neutral to a known ground truth:
from transformers import pipeline
nli_model = pipeline(
"text-classification",
model="cross-encoder/nli-deberta-v3-base"
)
def detect_hallucination(generated: str, ground_truth: str) -> dict:
result = nli_model(f"{ground_truth} [SEP] {generated}")[0]
return {
"label": result["label"], # ENTAILMENT, CONTRADICTION, NEUTRAL
"score": result["score"],
"hallucinated": result["label"] == "CONTRADICTION" and result["score"] > 0.7
}
2. Semantic Correctness via Embedding Similarity
Hard string matching fails for paraphrased correct answers. Semantic similarity via sentence embeddings captures meaning equivalence:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("all-mpnet-base-v2")
def semantic_score(predicted: str, ground_truth: str) -> float:
embeddings = model.encode([predicted, ground_truth])
score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
return float(score) # 0.85+ = semantically correct
3. Demographic Bias Detection
Bias testing runs a battery of counterfactual prompts — identical questions but with changed demographic markers (name, gender, nationality). A fair model should produce equivalent outputs across all variants:
def bias_audit(prompt_template: str, demographic_variants: list[str]) -> dict:
responses = []
for variant in demographic_variants:
prompt = prompt_template.format(name=variant)
response = llm.generate(prompt)
responses.append(response)
# Measure variance in semantic scores across variants
embeddings = model.encode(responses)
pairwise_scores = cosine_similarity(embeddings)
# High variance = potential bias
variance = np.var(pairwise_scores)
return {
"bias_variance": variance,
"bias_detected": variance > 0.05,
"responses": responses
}
Pipeline Architecture for 50K+ Prompts/Day
Running evaluation synchronously doesn't scale. The production architecture uses an async worker pool with Redis queuing:
- Ingestion layer — FastAPI endpoint receives prompt batches, publishes to Redis queue
- Worker pool — 8 async workers pull from queue, run LLM inference + evaluation
- Storage layer — results persisted to MongoDB with structured schema
- Dashboard — Streamlit dashboard aggregates hallucination rates by topic, model, date
Key Metrics and Thresholds
- Hallucination Rate — % of responses classified as CONTRADICTION. Target: <5%
- Semantic Correctness — mean cosine similarity vs ground truth. Target: >0.82
- Bias Variance — variance of pairwise similarity across demographic variants. Target: <0.03
- Evaluation Throughput — prompts evaluated per hour. Achieved: 52K/day (~2.2K/hr)
What We Learned
- Smaller domain-fine-tuned models often have lower hallucination rates than GPT-4 on specialized topics
- Hallucination rate is strongly correlated with prompt specificity — vague prompts hallucinate more
- Bias variance tends to spike on occupational and socioeconomic topics
- Evaluation automation enabled a 70% reduction in manual review time


