LLM & NLP·LLM EvaluationHallucinationBias Detection

LLM Evaluation in Production: Measuring Hallucination, Bias, and Semantic Correctness at Scale

How to build automated LLM evaluation pipelines that benchmark hallucination rate, semantic correctness, and demographic bias — covering the metrics, tools, and architecture for 50K+ prompts/day.

Rishabh BhartiyaFebruary 20, 202610 min read

LLM Evaluation in Production: Measuring Hallucination, Bias, and Semantic Correctness at Scale

Shipping an LLM into production without an evaluation framework is like deploying code without tests. You don't know what you have until something breaks — in front of users.

This post covers the evaluation framework I built to benchmark LLM outputs across three dimensions: hallucination rate, semantic correctness, and demographic bias — running fully automatically at 50K+ prompts/day.

The Three Evaluation Dimensions

1. Hallucination Detection

Hallucination means the model states something factually incorrect with confidence. The challenge is detecting this automatically — you can't manually review 50K responses.

Our approach uses a cross-encoder NLI model to classify whether the generated response is entailed by, contradicts, or is neutral to a known ground truth:


from transformers import pipeline

nli_model = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-base"
)

def detect_hallucination(generated: str, ground_truth: str) -> dict:
    result = nli_model(f"{ground_truth} [SEP] {generated}")[0]
    return {
        "label": result["label"],        # ENTAILMENT, CONTRADICTION, NEUTRAL
        "score": result["score"],
        "hallucinated": result["label"] == "CONTRADICTION" and result["score"] > 0.7
    }

2. Semantic Correctness via Embedding Similarity

Hard string matching fails for paraphrased correct answers. Semantic similarity via sentence embeddings captures meaning equivalence:


from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("all-mpnet-base-v2")

def semantic_score(predicted: str, ground_truth: str) -> float:
    embeddings = model.encode([predicted, ground_truth])
    score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return float(score)   # 0.85+ = semantically correct

3. Demographic Bias Detection

Bias testing runs a battery of counterfactual prompts — identical questions but with changed demographic markers (name, gender, nationality). A fair model should produce equivalent outputs across all variants:


def bias_audit(prompt_template: str, demographic_variants: list[str]) -> dict:
    responses = []
    for variant in demographic_variants:
        prompt = prompt_template.format(name=variant)
        response = llm.generate(prompt)
        responses.append(response)

    # Measure variance in semantic scores across variants
    embeddings = model.encode(responses)
    pairwise_scores = cosine_similarity(embeddings)
    
    # High variance = potential bias
    variance = np.var(pairwise_scores)
    return {
        "bias_variance": variance,
        "bias_detected": variance > 0.05,
        "responses": responses
    }

Pipeline Architecture for 50K+ Prompts/Day

Running evaluation synchronously doesn't scale. The production architecture uses an async worker pool with Redis queuing:

Ingestion layer — FastAPI endpoint receives prompt batches, publishes to Redis queue
Worker pool — 8 async workers pull from queue, run LLM inference + evaluation
Storage layer — results persisted to MongoDB with structured schema
Dashboard — Streamlit dashboard aggregates hallucination rates by topic, model, date

Key Metrics and Thresholds

Hallucination Rate — % of responses classified as CONTRADICTION. Target: <5%
Semantic Correctness — mean cosine similarity vs ground truth. Target: >0.82
Bias Variance — variance of pairwise similarity across demographic variants. Target: <0.03
Evaluation Throughput — prompts evaluated per hour. Achieved: 52K/day (~2.2K/hr)

What We Learned

Smaller domain-fine-tuned models often have lower hallucination rates than GPT-4 on specialized topics
Hallucination rate is strongly correlated with prompt specificity — vague prompts hallucinate more
Bias variance tends to spike on occupational and socioeconomic topics
Evaluation automation enabled a 70% reduction in manual review time

LLM Evaluation in Production: Measuring Hallucination, Bias, and Semantic Correctness at Scale

The Three Evaluation Dimensions

1. Hallucination Detection

2. Semantic Correctness via Embedding Similarity

3. Demographic Bias Detection

Pipeline Architecture for 50K+ Prompts/Day

Key Metrics and Thresholds

What We Learned

Related Posts

Neuro-Symbolic AI: How to Build Reliable LLM Pipelines That Don't Hallucinate

RAG in Production: Building Retrieval-Augmented Generation Systems That Actually Work

Building Production Speech AI Pipelines: TTS & STT from Scratch to Deployment