Learn/Reliability
CONCEPTS

Reliability

The probability that a system will function correctly under stated conditions for a specified period.

Reliability

The probability that a system will function correctly under stated conditions for a specified period.

Trust Validation

Reliability is bigger than uptime. It is: "Does the system do what the user expects it to do?"

Reliability vs. Availability

  • Availability: "The site loads."
  • Reliability: "The site loads and I can add items to my cart."
    A site that returns a 200 OK status code but serves a blank white page is Available but Unreliable.

Principles of Reliable Systems

  1. Redundancy: No single point of failure (N+1).
  2. Degradation: If the search bar breaks, the rest of the site should still work (Graceful Degradation).
  3. Simplicity: Boring is better. Complex systems break in complex ways.

ExThe "Ghost" Site

"A video streaming site was "up" (users could browse movies) but video playback failed for 10% of users."

Impact
Monitoring showed 99.99% availability (homepage loaded), but support tickets spiked.
Resolution
They changed their definition of Reliability to include "Successful Video Start Rate".

Why Reliability Matters

Reliability is the foundation of user trust. Unreliable systems lose customers and damage reputation.

SRE is fundamentally about engineering reliability into systems from the start.

Common Pitfalls

Monitoring Only Servers
Users don't care if your server CPU is low. They care if their video plays.
Reliability Theater
Having a status page that always stays green even when Twitter is exploding with complaints.

How to Use Reliability

๐ŸŽฏ
Set SLOs: Define what reliability means for your service.
๐Ÿ“Š
Measure SLIs: Track the metrics that matter to users.
๐Ÿงช
Test Regularly: Load test and practice incidents.

Frequently Asked Questions

How do I improve reliability?
Start by measuring it (SLOs). Then, rigorously test your failure modes (Chaos Engineering).
Can I be too reliable?
Yes. If users are happy with 99.9%, spending millions to get 99.99% is waste. Feature velocity matters too.

Learn More