Observability

The measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Observability

The measure of how well internal states of a system can be inferred from knowledge of its external outputs.

"Why is it broken?"

Observability (o11y) is the ability to answer new questions about your system without shipping new code.

Monitoring vs. Observability

Monitoring: Tells you when something is wrong. ("Alert: CPU is 99%").
Observability: Tells you why it is wrong. ("It's because User 123 sent a malformed JSON payload that triggered a regex loop in the Search Service").

The Three Pillars

Metrics: Aggregates. "Are we slow?" (Cheap, fast).
Logs: Events. "What happened?" (Detailed, expensive).
Traces: Context. "Where did the time go?" (Connects the dots).

ExThe Mystery Latency

"A checkout API was intermittently slow (5s response time) for 1% of requests. Metrics showed "High Average Latency" but not why."

Impact

Lost revenue from frustrated users.

Resolution

Using Distributed Tracing, the team saw that all slow requests were hitting a specific legacy fraud-check service that was timing out. Monitoring didn't check that legacy service, but Tracing revealed it immediately.

Why Observability Matters

Observability goes beyond monitoring. It helps you ask questions you didn't know to ask.

Good observability means faster MTTD because you can see what's happening in your systems in real-time.

Common Pitfalls

Hoarding Data

Storing terabytes of logs that no one ever reads. Observability is about *queryability*, not storage.

How to Use Observability

📊

Logs: Structured logs with trace IDs.

📈

Metrics: Time-series data for trends.

🔗

Traces: End-to-end request flow visualization.

Related Terms

Monitoring MTTD MTTR SLI

Frequently Asked Questions

Do I need a fancy tool?

Tools like Honeycomb, Datadog, or Jaeger help, but Observability is a practice, not a tool. Start with structured logs.

Learn More

State of Incident Management 2025

Industry benchmarks, Observability trends, and insights from 25+ engineering teams.

Free SRE Tools

Calculators, generators, and utilities for incident management teams.