System Design·System DesignMicroservicesKafka

Microservices for ML Products: Building Fault-Tolerant AI Systems at Scale

How to architect ML-powered products as fault-isolated microservices — covering domain-driven design, circuit breakers, async event streaming with Kafka, and the tradeoffs that matter at 100K+ users.

Rishabh BhartiyaNovember 20, 202510 min read

Microservices for ML Products: Building Fault-Tolerant AI Systems at Scale

Monolithic ML systems fail in predictable ways: a spike in one service cascades to everything else. The recommendation engine slows down, which backs up the feed service, which times out the frontend, which makes the entire product appear down — even though only one component is struggling.

This post covers the distributed microservices architecture behind E-News 2.0 — a media platform serving 100K+ concurrent users with Live TV, algorithmic personalisation, and AI-generated content.

Domain-Driven Service Decomposition

The first principle: decompose by domain, not by layer. Don't create "frontend service," "database service," "ML service." Create services that own entire vertical slices of the business domain.

For E-News 2.0, the domains are:

Feed Service (Golang) — article storage, retrieval, trending computation
Streaming Service — HLS manifest generation, CDN token management
Personalization Engine (Python) — user vectors, ranking algorithm, A/B testing
Breaking News Service — WebSocket connections, Kafka consumer, real-time push
API Gateway — authentication, rate limiting, routing, circuit breaking

The Circuit Breaker Pattern

Without circuit breakers, a slow Personalization Engine causes the entire Feed Service to queue up requests waiting for rankings — cascading failure. The circuit breaker short-circuits this:


import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing — reject requests
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, max_failures=5, timeout=60, success_threshold=2):
        self.max_failures = max_failures
        self.timeout = timeout
        self.success_threshold = success_threshold
        self.failures = 0
        self.successes = 0
        self.state = CircuitState.CLOSED
        self.last_failure_time = None

    def call(self, func, *args, fallback=None, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                # Circuit is open — return fallback immediately
                return fallback() if fallback else None

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            return fallback() if fallback else None

    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.successes += 1
            if self.successes >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failures = 0

    def _on_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.max_failures:
            self.state = CircuitState.OPEN

# Usage: personalization with graceful degradation
personalization_cb = CircuitBreaker(max_failures=5, timeout=30)

def get_personalized_feed(user_id: str) -> list:
    return personalization_cb.call(
        personalization_engine.rank,
        user_id,
        fallback=lambda: get_trending_feed()  # Graceful degradation
    )

Kafka for Real-Time Breaking News

When breaking news hits, thousands of users need to be notified simultaneously. Direct WebSocket broadcasts from the news service would create tight coupling. Kafka decouples the event producer from consumers:


from kafka import KafkaProducer, KafkaConsumer
import json

# Producer (News Ingestion Service)
producer = KafkaProducer(
    bootstrap_servers=["kafka:9092"],
    value_serializer=lambda v: json.dumps(v).encode()
)

def publish_breaking_news(article: dict):
    producer.send("breaking-news", {
        "article_id": article["id"],
        "headline": article["headline"],
        "category": article["category"],
        "timestamp": article["published_at"]
    })

# Consumer (WebSocket Push Service)
consumer = KafkaConsumer(
    "breaking-news",
    bootstrap_servers=["kafka:9092"],
    value_deserializer=lambda m: json.loads(m.decode()),
    group_id="websocket-push-service"
)

async def broadcast_breaking_news():
    for message in consumer:
        article = message.value
        # Broadcast to all connected WebSocket clients
        await websocket_manager.broadcast_to_category(
            category=article["category"],
            payload=article
        )

Deterministic Personalization: Explainability Over Black-Box ML

For a news platform, opaque ML ranking creates echo chambers and erodes trust. We use a deterministic, explainable ranking formula:


def rank_articles(articles: list[dict], user_prefs: dict) -> list[dict]:
    """
    Weighted ranking: 40% category affinity + 30% bias alignment + 30% recency.
    Fully transparent — users can understand why they see what they see.
    """
    from datetime import datetime, timezone

    now = datetime.now(timezone.utc)

    for article in articles:
        # Category affinity
        cat_score = user_prefs["category_weights"].get(article["category"], 0)

        # Political bias alignment (0 = far left, 1 = far right, 0.5 = center)
        bias_delta = abs(article["bias_score"] - user_prefs["bias_preference"])
        bias_score = 1.0 - bias_delta

        # Recency decay
        hours_old = (now - article["published_at"]).total_seconds() / 3600
        recency_score = 1.0 / (1.0 + 0.1 * hours_old)

        article["rank_score"] = (
            0.40 * cat_score +
            0.30 * bias_score +
            0.30 * recency_score
        )

    return sorted(articles, key=lambda x: x["rank_score"], reverse=True)

Production Outcomes

100K+ concurrent users without cascade failures
Feed service P95 latency: 47ms (99% of requests served from Redis)
Live TV streams unaffected during personalization engine slowdowns
Breaking news push latency: <800ms from publication to WebSocket delivery

Microservices for ML Products: Building Fault-Tolerant AI Systems at Scale

Domain-Driven Service Decomposition

The Circuit Breaker Pattern

Kafka for Real-Time Breaking News

Deterministic Personalization: Explainability Over Black-Box ML

Production Outcomes

Related Posts

FastAPI for ML Model Serving: A Production Engineering Guide

Feature Engineering for Production ML: From Raw Data to Deployable Signals

Neuro-Symbolic AI: How to Build Reliable LLM Pipelines That Don't Hallucinate