Learn/MTTD
CORE METRICS

Mean Time To Detect

The average time from when an issue occurs to when an alert fires.

MTTD = Σ Detection Times / n

The average time from when an issue occurs to when an alert fires.

The "Eyes" of the System

MTTD (Mean Time To Detect) measures the gap between "It broke" and "We know it broke."

The "Scream Test"

If your MTTD is high (hours or days), you likely rely on customers to report bugs via support tickets or Twitter. This is called the "Scream Test," and it is the worst way to monitor a system.

How to Improve MTTD

  1. Observability: You cannot detect what you cannot see. Instrument your code.
  2. Synthetics: Have a bot attempt to "Login" and "Checkout" every minute. If it fails, alert immediately.
  3. Anomaly Detection: Use AI/ML to detect weird patterns (e.g., "Traffic dropped by 50%").

ExThe Silent Cache Failure

"A caching layer failed, slowing the site down by 500ms. No errors were thrown, so no alerts fired."

Impact
The issue persisted for 3 days until a user complained about slowness.
Resolution
Team added Latency SLOs (alert if p95 latency > 200ms). MTTD for slowness dropped to <1 minute.

Why MTTD Matters

MTTD is the hidden killer of uptime. The faster you detect issues, the faster you can resolve them.

Many teams discover outages from customers first. Proactive detection prevents reputation damage.

MTTD vs. Other Metrics

MTTD
Issue → Alert fires
MTTA
Alert → Acknowledged
MTTR
Alert → Resolved

Common Pitfalls

Monitoring Only Uptime
A server can be "up" (responding 200 OK) but serving blank white pages. Monitor functionality, not just headers.
Missing Third-Party Failures
If Stripe goes down, your payments fail. Monitor your dependencies.

How to Use MTTD

📊
Comprehensive Monitoring: Implement "Four Golden Signals" (Latency, Traffic, Errors, Saturation).
🎯
SLOs: Alert when error budget burns too fast.
🔍
Synthetic Monitoring: Simulate user traffic to catch issues 24/7.

Industry Benchmarks

ExcellentTop 5%
< 1 min
GoodTop 15%
1-5 min
AverageTop 40%
5-15 min
StrugglingBelow Avg
> 15 min

Frequently Asked Questions

Is MTTD zero if I have instantaneous alerts?
Technically yes, but practically there is always a lag (e.g., 30s polling interval). < 1 minute is the gold standard.
Should I alert on everything?
No. Alerting on every minor jitter causes Alert Fatigue. focus on symptoms (User Pain) rather than causes (High CPU).

Learn More