Audio AI·TTSSTTSpeech AI

Building Production Speech AI Pipelines: TTS & STT from Scratch to Deployment

A complete engineering guide to building production-grade Text-to-Speech and Speech-to-Text pipelines — covering model selection, MFCC evaluation, latency optimization, and FastAPI deployment.

Rishabh BhartiyaMarch 1, 202611 min read

Building Production Speech AI Pipelines: TTS & STT from Scratch to Deployment

Speech AI is no longer research-only. TTS and STT systems are now core infrastructure in EdTech, accessibility tools, voice assistants, and content generation. But most tutorials stop at "run the model." Production deployment is an entirely different engineering challenge.

This post covers what I learned building production Speech AI pipelines at Edza.ai and HacktivSpace — covering architecture, evaluation, and the latency traps that kill real-time systems.

Architecture Overview: The 5-Stage Pipeline

Every production Speech AI pipeline has five distinct stages that must be independently optimized:

Preprocessing — audio normalization, noise reduction, sample rate standardization
Feature Extraction — MFCC, mel spectrograms, or raw waveform (model-dependent)
Model Inference — TTS synthesis or STT transcription
Post-processing — alignment, punctuation restoration, confidence filtering
Delivery — streaming vs batch, format conversion, caching

TTS: Choosing the Right Model for Production

The three dominant paradigms for production TTS are:

Concatenative synthesis (Festival, MaryTTS) — fast, robotic-sounding, no GPU needed
Statistical parametric (Tacotron 2 + WaveGlow) — natural, requires GPU, high latency
Neural end-to-end (VITS, Bark, XTTS) — highest quality, most complex deployment

For educational content at Edza.ai, we used VITS with a custom fine-tuned voice on domain-specific vocabulary (physics equations, chemical names). The key insight: a smaller domain-tuned model outperforms a large general model for specialized vocabulary.

MFCC-Based Evaluation: Measuring Voice Quality Objectively

MOS (Mean Opinion Score) requires human raters — expensive and slow for CI/CD pipelines. We built an automated MFCC-based evaluation pipeline that runs on every deployment:


import librosa
import numpy as np
from scipy.spatial.distance import cosine

def evaluate_tts_quality(reference_audio: np.ndarray,
                         generated_audio: np.ndarray,
                         sr: int = 22050) -> dict:
    """
    Compare TTS output against reference using MFCC cosine similarity.
    Score > 0.85 is production-acceptable.
    """
    ref_mfcc = librosa.feature.mfcc(y=reference_audio, sr=sr, n_mfcc=40)
    gen_mfcc = librosa.feature.mfcc(y=generated_audio, sr=sr, n_mfcc=40)

    # Mean MFCC vectors for global comparison
    ref_mean = np.mean(ref_mfcc, axis=1)
    gen_mean = np.mean(gen_mfcc, axis=1)

    similarity = 1 - cosine(ref_mean, gen_mean)
    return {
        "mfcc_similarity": similarity,
        "production_ready": similarity > 0.85
    }

Latency Optimization: The 3 Biggest Traps

Trap 1: Loading the Model on Every Request

Never load a model inside a request handler. Load once at startup and share the instance. With FastAPI, use the lifespan context manager:


from contextlib import asynccontextmanager
from fastapi import FastAPI

ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    ml_models["tts"] = load_tts_model("path/to/checkpoint")
    yield
    ml_models.clear()

app = FastAPI(lifespan=lifespan)

Trap 2: Synchronous Inference Blocking the Event Loop

TTS inference is CPU/GPU-bound. Running it synchronously in an async FastAPI route blocks all other requests. Use run_in_executor:


import asyncio
from fastapi import FastAPI

@app.post("/synthesize")
async def synthesize(text: str):
    loop = asyncio.get_event_loop()
    audio = await loop.run_in_executor(
        None,
        lambda: ml_models["tts"].synthesize(text)
    )
    return StreamingResponse(audio_to_bytes(audio), media_type="audio/wav")

Trap 3: Re-synthesizing Cached Content

In educational content, the same sentences are requested thousands of times. Cache synthesis results with a hash of the input text + voice parameters:


import hashlib, redis

cache = redis.Redis()

def get_cache_key(text: str, voice_id: str, speed: float) -> str:
    payload = f"{text}:{voice_id}:{speed}"
    return hashlib.sha256(payload.encode()).hexdigest()

Real-Time STT: Streaming Transcription Architecture

For real-time STT, the architecture shifts from batch to streaming. We use a sliding window approach with Whisper or wav2vec 2.0:

Chunk audio into 1-second windows with 0.5s overlap
Run inference on each chunk asynchronously
Use a language model to merge and correct chunk-boundary transcription errors
Return partial results via WebSocket for real-time display

Production Metrics to Track

Word Error Rate (WER) — primary STT quality metric, target <5% for clear speech
MFCC Similarity — automated TTS quality, target >0.85
Time-to-First-Audio (TTFA) — user-perceived latency, target <300ms
Real-Time Factor (RTF) — inference time / audio duration, target <0.3 for real-time