PROCESSES
Runbook
A step-by-step guide for handling specific operational tasks or incidents.
Runbook
A step-by-step guide for handling specific operational tasks or incidents.
The "Checklist"
A Runbook is a recipe. It assumes the reader is smart but stressed. It focuses on Action.
Elements of a Good Runbook
- Triggers: "Use this when Alert X fires."
- Impact: "This issue causes 500 errors on checkout."
- Steps:
- Check Dashboard Y.
- If CPU > 90%, run command Z.
- If not, escalate to Database Team.
- Verification: "How do I know it's fixed?"
Runbook vs. Documentation
- Docs: "Here is how the system works." (Read this on Tuesday morning).
- Runbook: "Here is how to fix the system." (Read this at 3 AM on Saturday).
ExThe "Restart" Runbook
"A complex microservice required a specific restart order (DB -> Cache -> App)."
Impact
Engineers often guessed the order, corrupting data.
Resolution
A simple checklist runbook was created: "Step 1: Stop App. Step 2: Flush Cache. Step 3: Restart DB." Incidents became trivial.
Why Runbook Matters
Runbooks reduce cognitive load during incidents. Follow the steps instead of figuring it out live.
Good runbooks enable on-call success and faster incident resolution.
Common Pitfalls
Outdated Info
Runbooks must be "living" documents. If a runbook fails, update it immediately.
Assuming Knowledge
Detailed "ssh" commands. Don't write "Connect to the server". Write "ssh user@10.0.0.1".
How to Use Runbook
โ๏ธ
Keep Simple: Checklists work better than essays.
๐
Update Often: Stale runbooks are worse than none.
๐งช
Test During Game Days: Verify runbooks actually work.
Related Terms
Frequently Asked Questions
How long should a runbook be?
Short. If it is longer than 1 page, no one will read it during an outage.
What if we can automate the runbook?
Do it! An executable script is the ultimate runbook. "Run `fix_db.sh`" is the best ongoing maintenance.