Mean Time Between Failures
The average time between system failures or incidents.
The average time between system failures or incidents.
The Reliability Metric
MTBF (Mean Time Between Failures) is a classic engineering metric (borrowed from hardware) that measures how often things break.
MTBF vs. MTTR
- MTBF asks: "How robust is the system?" (Reliability)
- MTTR asks: "How fast can we fix it?" (Resilience)
The Availability Equation
Availability (e.g., 99.9%) is mathematically derived from these two numbers:
Availability = MTBF / (MTBF + MTTR)
To improve availability, you can either crash less often (Increase MTBF) or fix it faster (Decrease MTTR). Modern SRE teams focus more on MTTR because complex systems will eventually fail, so being able to recover fast is sustainable than trying to prevent every failure.
ExThe Memory Leak
"A server process had a slow memory leak that caused it to crash every 48 hours (MTBF = 48h)."
Why MTBF Matters
MTBF measures system reliability. Higher MTBF means more stable infrastructure.
Used alongside MTTR to calculate availability: Availability = MTBF / (MTBF + MTTR).