A VP of Engineering at a Series B startup said something that stuck:
"We're pretty good at incident response. Our MTTR is solid, people know what to do when things break. But incident management? That's a mess. We have the same postmortem discussion every month, nothing changes, and I can't tell you the last time we updated our runbook."
Definition: Incident response
One-time, time-bound work during an active incident: declare, coordinate, restore service, and communicate.
Definition: Incident management
Ongoing work across the incident lifecycle: preparedness, runbooks, training, postmortems, and trend analysis to reduce recurrence.
He was describing something that tends to show up as teams scale: confusing two very different things.
Many teams are fast at fixing things but slow at learning. The same database outage happens every quarter. The runbook is 8 months out of date. Nobody reviews incident trends.
This article explains the difference, why it matters, and how to fix the imbalance in your incident management process.
Contents:
- The Difference
- Why teams confuse them
- Failure modes
- How to build both
- What to focus on first
- FAQ
The Difference
| Incident Response | Incident Management | |
|---|---|---|
| What it is | Tactical execution during an incident | Strategic oversight of the entire incident lifecycle |
| Timeframe | Minutes to hours (while incident is active) | Ongoing, always (between incidents too) |
| Goal | Restore service fast | Reduce incident frequency and severity over time |
| Mindset | Urgent, reactive | Deliberate, proactive |
| Key activities | Declare, coordinate, fix, communicate | Postmortems, runbooks, on-call, training, trend analysis |
| Success metric | MTTR (Mean Time To Restore) | Incident frequency, repeat incident rate, MTTD (mean time to detect), action completion rate |
| Who owns it | Incident Lead (temporary role during incident) | Engineering team (ongoing responsibility) |
| Skills required | Debugging, communication, decisions under pressure | Process design, facilitation, data analysis, coaching |
Incident response is what you do during the outage. Incident management is what you do between outages.
Key takeaways:
- Incident response restores service; incident management prevents recurrence
- MTTR can improve while reliability worsens, if recurrence stays high, you're just getting faster at fixing the same problems
- Friction kills follow-through. Make updates, runbooks, and action items easy if you want them to actually happen
- The best teams treat incidents as a system to improve over time, not a series of one-off emergencies
If You Do Nothing Else This Week
Define severity (SEV0–SEV3) and response roles (Incident Lead, Comms, Fixer). Everyone should know what SEV0 means and who does what when it happens.
Set update cadence (every 15–30 minutes) and a single source of truth. Not DMs, not email threads. Just one place where everyone can see what's happening.
Require postmortems for SEV0/1 and "new failure modes." If you've seen this incident 10 times before, you don't need another postmortem. You need to finally execute on the previous one's action items. Track three metrics: repeat-incident rate, action-item closure rate, mean-time-to-detect (MTTD). MTTR matters, but repeat rate tells you if you're actually improving.
Do a 30-minute monthly incident review with one owner. Someone looks at the data and asks "what patterns do we see?" That's it. No marathon session, no slides, just pattern recognition.
Why Teams Keep Confusing Them
"Our MTTR is under an hour. We handle SEV0/1 incidents."
That was Sarah, an EM at a 60-person fintech company. Their MTTR was 42 minutes, solid. But underneath that, the runbook was last updated in March. They'd had the same connection pool exhaustion issue three times in six months. Postmortems were "whenever we get to it" (often never). No one looked at incident trends or patterns. On-call was "whoever's around."
They were confusing fast response with good management.
Then there's the friction problem.
Postmortems feel like homework because you're writing in a Google Doc, then copying to Confluence, then making a Jira ticket, then posting in Slack. Runbooks don't get updated because editing them is a pain. Trend analysis doesn't happen because you're exporting CSVs and making charts in spreadsheets.
One team put it: "We have 40-page runbooks that no one has opened in 6 months. I can't blame them. Editing them is terrible."
They're not undisciplined. They're working against friction.
Both teams treat incident management as an extension of incident response. But they're different disciplines. Response is tactical, urgent, short term. Fix the problem, execution, "how do we fix this?" Management is strategic, deliberate, long term. Fix the system, system design, "how do we prevent this?"
A 15-minute MTTR means nothing if the same outage happens every quarter.
What Happens When You Focus on Only One
Strong Response, Weak Management
Great MTTR but the same incidents keep happening. Postmortems are written but nothing changes. Runbooks exist but are outdated. No one knows if things are getting better. A 50-person B2B SaaS company had a database outage in January 2024, wrote a postmortem, then had the same outage in March, May, and again in December.
"I realized we'd never actually done anything the postmortem recommended. We just filed it away and waited for the next incident."
Fast at fixing, slow at learning. Stuck in reactive mode, never getting ahead of incidents. Great MTTR looks good on a dashboard, but if the same database outage happens every quarter, you're not actually improving. You're optimizing for speed while ignoring recurrence.
Strong Management, Weak Response
Detailed processes and runbooks nobody has read. Quarterly incident reviews but chaos during actual incidents. Great analysis culture but slow execution when things break. Roles unclear during incidents.
One Series A team shared their 40-page incident response handbook. It had been meticulously written by their former Head of Infrastructure. When asked who'd read it, the room went quiet. During their last SEV0, no one could find the escalation tree. The incident took 3 hours to resolve. It should have taken 45 minutes.
Great plans that fall apart in the heat of the moment. Great postmortems don't matter if customers wait hours for a fix that should take minutes. You're optimizing for learning while ignoring execution.
How to Build Both
Here's what good looks like, with specific examples.
Incident Response: Fast, Coordinated, Consistent
Good incident response isn't just fast fixing. It's coordinated fixing.
Bad response looks like: 15 people debugging the same thing, nobody coordinating, DMs scattered across Slack, nobody knows who's working on what.
Good response looks like: One person declares. One Incident Lead coordinates. One Assigned Engineer fixes. Updates in one place. Everyone knows who's doing what.
Clear roles are essential. The Incident Lead coordinates while the Assigned Engineer fixes. Split the work. Declare fast, say "This is SEV2" in 30 seconds instead of debating for 10. Keep updates in one place where everyone can see them, not scattered across DMs or email threads. If there's no response in 10 minutes, page backup immediately. And stabilize first: rollback beats fix-forward when customers are waiting.
This is tactical execution. It's what you do in the heat of the moment.
Incident Management: Continuous Improvement, Not Theater
Good incident management means reducing friction everywhere. When the right thing to do is also the easy thing to do, teams actually do it.
For postmortems, one team assigned action items IN the postmortem doc, not a separate Jira ticket. Teams with separate tickets struggle to close them, while inline assignments get done. They set deadlines 2 weeks out, not "Q2." Vague timelines equal never happens.
For runbooks, update them when things change, not 8 months later. Make them easy to edit. One team updated runbooks inline during postmortems, the facilitator types changes directly into the doc while everyone reviews. No separate Google Doc, no copy-paste to Confluence later.
For on-call, clear rotations. Not "whoever's around." Make handoffs frictionless. One team used a simple Slack bot that auto-assigned the next person in rotation. When the person who wrote the Slack script left, the rotation broke. Build for sustainability.
For trend analysis, someone reviews incident data monthly. Ask "what patterns do we see?" Make the data visible. One team set up an auto-generated CSV that posts to Slack every Monday. No manual exports, no spreadsheets.
For training, new engineers know the process before their first SEV0. Make learning accessible. One team does quarterly "game days" where they practice a simulated incident. No production stress, just learning.
The pattern: reduce friction everywhere. When postmortems are easy to write, runbooks are easy to update, and incident data is easy to see, teams actually do the work.
Which Should You Focus On First?
| Your situation | Focus on this first | Why |
|---|---|---|
| New team, first real incidents | Response | Don't even think about management until you've handled 10+ incidents. You can't design a system you haven't experienced. |
| MTTR solid but same fires recur | Management | Pick ONE recurring incident and fix it completely before building process. Process without a win feels like bureaucracy. |
| Incidents chaotic and slow | Response | Fix execution before you optimize for learning. Coordination breakdowns kill response speed. |
| Postmortems never lead to changes | Management | You have the response process. Now build the learning loop. Friction is the enemy. Make action items trackable in the postmortem doc itself. |
| On-call burnout high | Both | Response needs less chaos (coordination). Management needs better rotations (sustainability). |
Quick wins by situation:
- New team: Define SEV0/1, declare in Slack, assign one Incident Lead
- Same fires recurring: Close ONE recurring incident's action items completely
- Chaotic incidents: Use one Slack channel, one Incident Lead, updates every 15 min
- Postmortems don't lead to change: Assign action items IN the postmortem doc with 2-week deadlines
- On-call burnout: Set primary+backup rotation, use escalation rules
The Bottom Line
In practice, teams hit the same ceiling when they treat these as the same thing.
Both matter. Focus on only one and you hit a ceiling. Strong response, weak management means same fires every month, reactive forever. Strong management, weak response means great plans that fall apart when things break.
The best teams are fast at fixing things AND systematic about learning.
Don't be the team with 40-page runbooks no one reads. Don't be the team fighting the same database outage every quarter. Build both.
FAQ
Our MTTR is great but we keep having the same outages. What are we missing?
What metrics matter besides MTTR?
What should a lightweight postmortem include?
When should we actually write a postmortem vs just fix and move on?
How do I convince my team to actually update runbooks?
What's the difference between Incident Lead and incident management?
Why do we keep fighting the same fires every month?
Mini glossary:
MTTR: Mean time to restore service
MTTD: Mean Time To Detect (the average time from when an issue occurs to when an alert fires)
PIR: Post-incident review or postmortem
Incident Lead: The person coordinating the response during an incident
SEV0–SEV3: Severity levels (define yours: SEV0 is critical, SEV3 is minor)
Related guides (if you want templates):
- Incident Response Playbook: Scripts, Roles & Templates - Tactical execution during incidents
- Post-Incident Review Templates - Strategic learning after incidents
- On-Call Rotation Guide - Building sustainable on-call
- Scaling Incident Management - How teams evolve as they grow