Microservices for ML Products: Building Fault-Tolerant AI Systems at Scale
How to architect ML-powered products as fault-isolated microservices — covering domain-driven design, circuit breakers, async event streaming with Kafka, and the tradeoffs that matter at 100K+ users.

Monolithic ML systems fail in predictable ways: a spike in one service cascades to everything else. The recommendation engine slows down, which backs up the feed service, which times out the frontend, which makes the entire product appear down — even though only one component is struggling.
This post covers the distributed microservices architecture behind E-News 2.0 — a media platform serving 100K+ concurrent users with Live TV, algorithmic personalisation, and AI-generated content.
Domain-Driven Service Decomposition
The first principle: decompose by domain, not by layer. Don't create "frontend service," "database service," "ML service." Create services that own entire vertical slices of the business domain.
For E-News 2.0, the domains are:
- Feed Service (Golang) — article storage, retrieval, trending computation
- Streaming Service — HLS manifest generation, CDN token management
- Personalization Engine (Python) — user vectors, ranking algorithm, A/B testing
- Breaking News Service — WebSocket connections, Kafka consumer, real-time push
- API Gateway — authentication, rate limiting, routing, circuit breaking
The Circuit Breaker Pattern
Without circuit breakers, a slow Personalization Engine causes the entire Feed Service to queue up requests waiting for rankings — cascading failure. The circuit breaker short-circuits this:
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing — reject requests
HALF_OPEN = "half_open" # Testing recovery
class CircuitBreaker:
def __init__(self, max_failures=5, timeout=60, success_threshold=2):
self.max_failures = max_failures
self.timeout = timeout
self.success_threshold = success_threshold
self.failures = 0
self.successes = 0
self.state = CircuitState.CLOSED
self.last_failure_time = None
def call(self, func, *args, fallback=None, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
# Circuit is open — return fallback immediately
return fallback() if fallback else None
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
return fallback() if fallback else None
def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.successes += 1
if self.successes >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failures = 0
def _on_failure(self):
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.max_failures:
self.state = CircuitState.OPEN
# Usage: personalization with graceful degradation
personalization_cb = CircuitBreaker(max_failures=5, timeout=30)
def get_personalized_feed(user_id: str) -> list:
return personalization_cb.call(
personalization_engine.rank,
user_id,
fallback=lambda: get_trending_feed() # Graceful degradation
)
Kafka for Real-Time Breaking News
When breaking news hits, thousands of users need to be notified simultaneously. Direct WebSocket broadcasts from the news service would create tight coupling. Kafka decouples the event producer from consumers:
from kafka import KafkaProducer, KafkaConsumer
import json
# Producer (News Ingestion Service)
producer = KafkaProducer(
bootstrap_servers=["kafka:9092"],
value_serializer=lambda v: json.dumps(v).encode()
)
def publish_breaking_news(article: dict):
producer.send("breaking-news", {
"article_id": article["id"],
"headline": article["headline"],
"category": article["category"],
"timestamp": article["published_at"]
})
# Consumer (WebSocket Push Service)
consumer = KafkaConsumer(
"breaking-news",
bootstrap_servers=["kafka:9092"],
value_deserializer=lambda m: json.loads(m.decode()),
group_id="websocket-push-service"
)
async def broadcast_breaking_news():
for message in consumer:
article = message.value
# Broadcast to all connected WebSocket clients
await websocket_manager.broadcast_to_category(
category=article["category"],
payload=article
)
Deterministic Personalization: Explainability Over Black-Box ML
For a news platform, opaque ML ranking creates echo chambers and erodes trust. We use a deterministic, explainable ranking formula:
def rank_articles(articles: list[dict], user_prefs: dict) -> list[dict]:
"""
Weighted ranking: 40% category affinity + 30% bias alignment + 30% recency.
Fully transparent — users can understand why they see what they see.
"""
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
for article in articles:
# Category affinity
cat_score = user_prefs["category_weights"].get(article["category"], 0)
# Political bias alignment (0 = far left, 1 = far right, 0.5 = center)
bias_delta = abs(article["bias_score"] - user_prefs["bias_preference"])
bias_score = 1.0 - bias_delta
# Recency decay
hours_old = (now - article["published_at"]).total_seconds() / 3600
recency_score = 1.0 / (1.0 + 0.1 * hours_old)
article["rank_score"] = (
0.40 * cat_score +
0.30 * bias_score +
0.30 * recency_score
)
return sorted(articles, key=lambda x: x["rank_score"], reverse=True)
Production Outcomes
- 100K+ concurrent users without cascade failures
- Feed service P95 latency: 47ms (99% of requests served from Redis)
- Live TV streams unaffected during personalization engine slowdowns
- Breaking news push latency: <800ms from publication to WebSocket delivery


