Back to Blog
LLM & NLP·LLM EvaluationHallucinationBias Detection

LLM Evaluation in Production: Measuring Hallucination, Bias, and Semantic Correctness at Scale

How to build automated LLM evaluation pipelines that benchmark hallucination rate, semantic correctness, and demographic bias — covering the metrics, tools, and architecture for 50K+ prompts/day.

Rishabh Bhartiya10 min read
LLM Evaluation in Production: Measuring Hallucination, Bias, and Semantic Correctness at Scale

Shipping an LLM into production without an evaluation framework is like deploying code without tests. You don't know what you have until something breaks — in front of users.

This post covers the evaluation framework I built to benchmark LLM outputs across three dimensions: hallucination rate, semantic correctness, and demographic bias — running fully automatically at 50K+ prompts/day.

The Three Evaluation Dimensions

1. Hallucination Detection

Hallucination means the model states something factually incorrect with confidence. The challenge is detecting this automatically — you can't manually review 50K responses.

Our approach uses a cross-encoder NLI model to classify whether the generated response is entailed by, contradicts, or is neutral to a known ground truth:


from transformers import pipeline

nli_model = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-base"
)

def detect_hallucination(generated: str, ground_truth: str) -> dict:
    result = nli_model(f"{ground_truth} [SEP] {generated}")[0]
    return {
        "label": result["label"],        # ENTAILMENT, CONTRADICTION, NEUTRAL
        "score": result["score"],
        "hallucinated": result["label"] == "CONTRADICTION" and result["score"] > 0.7
    }

2. Semantic Correctness via Embedding Similarity

Hard string matching fails for paraphrased correct answers. Semantic similarity via sentence embeddings captures meaning equivalence:


from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("all-mpnet-base-v2")

def semantic_score(predicted: str, ground_truth: str) -> float:
    embeddings = model.encode([predicted, ground_truth])
    score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return float(score)   # 0.85+ = semantically correct

3. Demographic Bias Detection

Bias testing runs a battery of counterfactual prompts — identical questions but with changed demographic markers (name, gender, nationality). A fair model should produce equivalent outputs across all variants:


def bias_audit(prompt_template: str, demographic_variants: list[str]) -> dict:
    responses = []
    for variant in demographic_variants:
        prompt = prompt_template.format(name=variant)
        response = llm.generate(prompt)
        responses.append(response)

    # Measure variance in semantic scores across variants
    embeddings = model.encode(responses)
    pairwise_scores = cosine_similarity(embeddings)
    
    # High variance = potential bias
    variance = np.var(pairwise_scores)
    return {
        "bias_variance": variance,
        "bias_detected": variance > 0.05,
        "responses": responses
    }

Pipeline Architecture for 50K+ Prompts/Day

Running evaluation synchronously doesn't scale. The production architecture uses an async worker pool with Redis queuing:

  • Ingestion layer — FastAPI endpoint receives prompt batches, publishes to Redis queue
  • Worker pool — 8 async workers pull from queue, run LLM inference + evaluation
  • Storage layer — results persisted to MongoDB with structured schema
  • Dashboard — Streamlit dashboard aggregates hallucination rates by topic, model, date

Key Metrics and Thresholds

  • Hallucination Rate — % of responses classified as CONTRADICTION. Target: <5%
  • Semantic Correctness — mean cosine similarity vs ground truth. Target: >0.82
  • Bias Variance — variance of pairwise similarity across demographic variants. Target: <0.03
  • Evaluation Throughput — prompts evaluated per hour. Achieved: 52K/day (~2.2K/hr)

What We Learned

  • Smaller domain-fine-tuned models often have lower hallucination rates than GPT-4 on specialized topics
  • Hallucination rate is strongly correlated with prompt specificity — vague prompts hallucinate more
  • Bias variance tends to spike on occupational and socioeconomic topics
  • Evaluation automation enabled a 70% reduction in manual review time

Tags

LLM EvaluationHallucinationBias DetectionNLPBenchmarkingHuggingFace

Author

Rishabh Bhartiya

ML Engineer · NatrajX

Related Posts

All posts