PROCESSES
Incident Response
The process of detecting, responding to, and resolving system incidents or outages.
Incident Response
The process of detecting, responding to, and resolving system incidents or outages.
"Put the Fire Out"
Incident Response is the tactical, immediate action taken when things go wrong. It is the firefighter kicking down the door.
The Lifecycle of Response
- Detect (T0): Monitoring alerts the team.
- Acknowledge: On-call engineer responds.
- Mobilize: Incident Commander assigned. Channel created.
- Triage: Assess severity (SEV level).
- Mitigate: Stop the bleeding (rollback, scaling).
- Resolve: Restore full service.
Speed vs. Accuracy
The goal of incident response is Mitigation, not necessarily fixing the root cause. If a server is crashing, reboot it to get users back online. Debug why it crashed later (in the Post-Incident Review).
ExThe Black Friday Crash
"Traffic spiked 10x on Black Friday, crashing the checkout service."
Impact
Company lost $100k/minute.
Resolution
The team didn't try to optimize the code live. They simply provisioned 500 extra servers (costing $5k) to handle the load. They investigated the code inefficiency the next week.
Why Incident Response Matters
Good incident response minimizes downtime, customer impact, and team stress.
Every organization will face incidents. The difference is how well you respond.
Common Pitfalls
Hero Mode
One person trying to do everything (Command, Comms, fixing). Assign roles immediately.
Chaos
20 people talking at once. The IC must enforce radio discipline.
How to Use Incident Response
📖
Have Runbooks: Pre-written response procedures.
🎯
Practice Regularly: Game days build muscle memory.
📊
Review & Improve: Post-mortems drive continuous improvement.
Frequently Asked Questions
Who is in charge?
The Incident Commander (IC). Even if the CEO is on the call, the IC calls the shots.
Should we fix the bug during the incident?
Only if there is no other way. Rollbacks are safer than rolling forward with a hotfix.