Learn/MTBF
CORE METRICS

Mean Time Between Failures

The average time between system failures or incidents.

MTBF = Total Uptime / Number of Failures

The average time between system failures or incidents.

The Reliability Metric

MTBF (Mean Time Between Failures) is a classic engineering metric (borrowed from hardware) that measures how often things break.

MTBF vs. MTTR

  • MTBF asks: "How robust is the system?" (Reliability)
  • MTTR asks: "How fast can we fix it?" (Resilience)

The Availability Equation

Availability (e.g., 99.9%) is mathematically derived from these two numbers:

Availability = MTBF / (MTBF + MTTR)

To improve availability, you can either crash less often (Increase MTBF) or fix it faster (Decrease MTTR). Modern SRE teams focus more on MTTR because complex systems will eventually fail, so being able to recover fast is sustainable than trying to prevent every failure.

ExThe Memory Leak

"A server process had a slow memory leak that caused it to crash every 48 hours (MTBF = 48h)."

Impact
Frequent, predictable outages annoyed users.
Resolution
Engineers fixed the leak. Now the server runs 90 days without crashing (MTBF = 2160h).

Why MTBF Matters

MTBF measures system reliability. Higher MTBF means more stable infrastructure.

Used alongside MTTR to calculate availability: Availability = MTBF / (MTBF + MTTR).

MTBF vs. Other Metrics

MTBF
Uptime between failures
MTTR
Downtime duration
Availability
MTBF / (MTBF + MTTR)

Common Pitfalls

Tracking MTBF for Software
Hardware fails due to wear-out. Software fails due to changes. MTBF is less useful for software than hardware.
Ignoring Recovery
Optimizing for high MTBF (never failing) is expensive. Optimizing for low MTTR (fast recovery) is often better.

How to Use MTBF

๐Ÿ”ง
Preventive Maintenance: Automate patch management and dependency updates.
๐Ÿงช
Testing: Catch bugs in staging with Integration/E2E tests.
๐Ÿ“ˆ
Capacity Planning: Auto-scale before you hit resource limits.

Industry Benchmarks

ExcellentTop 5%
> 720 hours
GoodTop 15%
168-720 hours
AverageTop 40%
72-168 hours
StrugglingBelow Avg
< 72 hours

Frequently Asked Questions

Does scheduled maintenance count?
Usually no. MTBF measures *failures* (unexpected outages). Scheduled downtime is planned.
Is high MTBF always good?
Yes, but not at the cost of velocity. If you never ship code, your MTBF will be infinite, but your product will die. Balance innovation with reliability.

Learn More