Distributed SystemsMarch 11, 202610 min read

Lessons from Operating Observability at Scale

Reflections from building and operating observability infrastructure at a cloud-native monitoring platform, covering what we got right, what surprised us, and the patterns that hold up across any monitoring system.

I spent two years working on a cloud-native observability platform that monitored thousands of services across some of the world's largest enterprises. The experience fundamentally changed how I think about building systems. Here are the lessons that stuck.

Your Monitoring System Is a Distributed System

This sounds obvious, but it has deep implications. The system you use to detect failures is itself subject to all the same failure modes: network partitions, clock skew, data loss, cascading failures. We had incidents where our alerting pipeline went down at the exact moment the customer's system was failing, precisely when they needed us most.

The solution was aggressive redundancy and independent failure domains. Our metric ingestion pipeline ran across multiple availability zones with independent processing paths. If one zone went dark, the others continued ingesting and alerting. We treated our own SLOs with the same rigor our customers expected us to provide for theirs.

Cardinality Is the Silent Killer

Every observability system eventually hits a cardinality wall. A customer adds a user ID label to their metrics, and suddenly you're indexing millions of unique time series. Your storage costs spike, query latency degrades, and your aggregation layer starts OOM-killing.

// This looks innocent but creates O(users) time series
counter.labels({ endpoint: "/api/data", userId: req.userId }).inc();

// This is what you actually want
counter.labels({ endpoint: "/api/data", status: res.statusCode }).inc();

We built automatic cardinality detection that flagged label combinations exceeding thresholds before they brought down the pipeline. The hardest part wasn't the technical solution; it was convincing teams that fewer labels often means better observability. High-cardinality labels belong in traces, not metrics.

The Three Pillars Are Actually One Pillar

The industry talks about "three pillars" (metrics, logs, and traces) as if they're separate concerns. In practice, the most effective observability comes from correlating all three. When a P99 latency spike shows up in metrics, you need to click through to the traces that contributed to that percentile, then see the logs from those specific requests. Any gap in that correlation chain means slower incident response.

At our platform, we invested heavily in automatic context propagation — making sure trace IDs flowed through every log line and every metric sample. This wasn't trivial in polyglot environments where services mixed Java, Node.js, Go, and Python.

Alerting Is a UX Problem

Most monitoring tools get alerting wrong because they treat it as a math problem. Set a threshold, fire when exceeded. But the real challenge is alert quality. If your on-call engineer gets paged 50 times a night, they start ignoring alerts. That's worse than having no alerts at all.

We moved toward anomaly-based alerting with aggressive deduplication and correlation. Instead of firing separate alerts for CPU, memory, and error rate, we grouped correlated symptoms into a single incident with a root cause hypothesis. The goal was one page per actual incident, not one page per symptom. Getting this right reduced alert fatigue by 70% for our largest customers.

ObservabilityDistributed SystemsMonitoring