Learn/Root Cause Analysis
PROCESSES

Root Cause Analysis

A systematic method for identifying the underlying causes of problems or incidents.

Root Cause Analysis

A systematic method for identifying the underlying causes of problems or incidents.

"Digging Deeper"

Root Cause Analysis (RCA) is the detective work of SRE. It moves past "The server crashed" to "Why did the server crash?".

The "5 Whys" Technique

Ask "Why?" five times to get to the root.

  1. Why? The database locked up.
  2. Why? It ran out of connections.
  3. Why? The new "Recomendations" service leaked connections.
  4. Why? The connection pool library was outdated.
  5. Why? (Root Cause): We don't have automated dependency scanning to catch outdated libraries.

The goal is Prevention

If you fix the connection leak (symptom) but don't fix the dependency scanning (root cause), another library will break next month.

ExThe Jefferson Memorial

"The stone of the Jefferson Memorial was eroding. Why? They washed it too often. Why? Too many birds pooped on it. Why? Birds ate spiders there. Why? Spiders ate midges (bugs). Why? Midges swarmed the lights at dusk."

Impact
Complex chain of causality.
Resolution
Root Cause: The lights turned on 1 hour too early. Solution: Turn lights on 1 hour later. The midges (and birds) left.

Why Root Cause Analysis Matters

Treating symptoms without finding root causes guarantees the problem will recur.

Good RCA prevents incidents from happening again and improves system reliability over time.

Common Pitfalls

Stopping at Human Error
If your RCA ends with "Engineer made a typo," you failed. Ask why the system allowed a typo to take down prod.

How to Use Root Cause Analysis

Ask 5 Whys: Dig deep to find the real cause.
🐟
Fishbone Diagram: Visualize possible causes.
📝
Document Everything: Keep a record for future reference.

Frequently Asked Questions

Is there always one root cause?
Rarely. In complex systems, it is usually a combination of factors ("Swiss Cheese Model"). But "Root Cause" is a useful shorthand for "The thing we can fix to prevent this."

Learn More