Root Cause Analysis
A systematic method for identifying the underlying causes of problems or incidents.
A systematic method for identifying the underlying causes of problems or incidents.
"Digging Deeper"
Root Cause Analysis (RCA) is the detective work of SRE. It moves past "The server crashed" to "Why did the server crash?".
The "5 Whys" Technique
Ask "Why?" five times to get to the root.
- Why? The database locked up.
- Why? It ran out of connections.
- Why? The new "Recomendations" service leaked connections.
- Why? The connection pool library was outdated.
- Why? (Root Cause): We don't have automated dependency scanning to catch outdated libraries.
The goal is Prevention
If you fix the connection leak (symptom) but don't fix the dependency scanning (root cause), another library will break next month.
ExThe Jefferson Memorial
"The stone of the Jefferson Memorial was eroding. Why? They washed it too often. Why? Too many birds pooped on it. Why? Birds ate spiders there. Why? Spiders ate midges (bugs). Why? Midges swarmed the lights at dusk."
Why Root Cause Analysis Matters
Treating symptoms without finding root causes guarantees the problem will recur.
Good RCA prevents incidents from happening again and improves system reliability over time.