- [Post-Incident Review Templates: What Works (3 Ready-to-Use)](/blog/post-incident-review-template) — Copy-paste templates for postmortems

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Q: Why split roles?

One person trying to coordinate AND debug? Both suffer.

Q: How long should an incident last?

Timelines vary. If a SEV2 is running >4 hours, reassess severity, staffing, and rollback/failover options.

Your phone lights up at 2:47 AM. PagerDuty, Slack, maybe a phone call from your CEO. Something's broken. People are noticing. And you're the one who has to fix it.

We've talked to dozens of engineering teams about incidents. The thing that comes up over and over: the debugging isn't the hard part—it's the coordination. See our incident coordination guide on reducing context switching across tools and improving MTTR for more on why coordination matters more than speed.

Who's in charge? What do we tell customers? Why are 15 people asking for updates in DMs? Should we call a Zoom? Is this SEV1 or SEV2?

The outage is the easy part. The chaos is what makes incidents last 3 hours instead of 30 minutes.

What Is Incident Response?

Incident response isn't debugging. Debugging happens after.

Incident response is what happens the second after the alert fires:

Declaration: Announcing the incident and severity
Coordination: Assigning roles (Incident Lead, Assigned Engineer)
Investigation: Finding and fixing the root cause
Communication: Keeping stakeholders and customers informed
Resolution: Confirming the fix and documenting what happened

Goal: restore service fast, then prevent recurrence.

Most Teams Get This Wrong

We talked to a 40-person B2B SaaS company that got hit with a SEV0 at 3 AM. Database went down. Checkout completely broken.

Want to know what went wrong?

No one declared it. People started debugging in DMs. 45 minutes in, the CEO joined Slack and asked "is anyone working on this?"

The person debugging was also trying to coordinate. They were updating support, fielding questions from leadership, AND trying to debug. Both suffered.

They kept saying "fixed in 5 minutes" - repeated every 10 minutes for 2 hours. Trust evaporated.

The incident dragged on not because the engineering problem was hard. It was because the coordination was broken.

Same team, next SEV0? They used a clear playbook. Resolved in 52 minutes. Same engineers, different process.

Incident Response Approaches Compared

Comparison of incident response approaches showing speed, coordination quality, team size fit, and failure conditions
Approach	Speed	Coordination	Works For	Breaks When
No playbook	Slow	Chaotic	<10 people	Any serious incident
Ad-hoc responses	Variable	Inconsistent	<30 people	Multiple concurrent incidents
Clear playbook (this approach)	Fast	Structured	20-200 people	Nobody follows it
Enterprise ITSM	Slow	Heavy process	200+ people	Too much overhead for smaller teams

Teams with clear playbooks resolve incidents 40-60% faster than ad-hoc responses.

What You'll Learn

The First 5 Minutes - Declare, assign roles, stabilize
Incident Roles - Who does what (Incident Lead, Engineer, Comms)
Severity Levels & Escalation - When to page, when to wait
Update Cadence - How often to post updates by severity
Customer Communication - Support scripts and status pages
Closing the Incident - Resolution summary and postmortem assignment
Common Anti-Patterns - What to avoid
Quick Reference - Checklists and decision trees

What Actually Works

In our conversations with engineering teams, the fast ones are consistent about seven things:

Key behavioral differences between slow and fast incident response teams and their impact
Slow Teams Do	Fast Teams Do	Impact
Debate severity for 10+ minutes	Declare in 30 seconds: "This is SEV2, fixing if needed"	Cuts coordination delay
One person tries to coordinate + debug	Split roles: Lead coordinates, Engineer fixes	Lower MTTR
Updates via DM or "hop on a call"	Updates in channel, pinned on severity cadence	Stops "any update?" pings
"Should be fixed in 5 min" (repeated)	"ETA unknown, investigating" then actual ETA when known	Trust maintained
Escalate after 30 min of silence	Response timer by severity: no response → page backup → EM	Faster time to fix
Forget support team until postmortem	Notify support immediately: "Here's your script"	Support not overwhelmed
End with "cool, it's fixed"	Post resolution summary + assign postmortem owner	Learning captured

Same engineers, different process.

The First 5 Minutes

Incidents live or die in the first 5 minutes. Declare fast, split roles, stabilize. The rest is details.

Step 1: Declare in 30 seconds

Post this in your incident channel:

🚨 Incident declared. Starting at SEV2 while we investigate.

Don't debate severity while production is burning.

An EM we interviewed put it bluntly: "We lost 15 minutes once arguing SEV1 vs SEV2. Meanwhile, customers couldn't check out. Just declare it. You can always downgrade later."

If anyone argues, say this:

"Let's start at SEV2. If it's worse, we escalate. If it's better, we downgrade. Arguing costs more time than fixing."

Step 2: Assign roles in 60 seconds

If no one steps up in 60 seconds, YOU do it.

👤 I'm Incident Lead. @bob is Assigned Engineer.

Or if someone else should lead:

👤 @alice is Incident Lead. I'll assist as needed.

Incident Lead coordinates. Assigned Engineer fixes. Split the work.

[!TIP]
If you don't pin the incident state immediately, you'll repeat yourself to every latecomer.

Step 3: Stabilize first, root cause later

Your goal is to restore service FIRST, understand SECOND. Every minute of downtime costs money and trust. Root cause analysis comes after customers are unblocked.

Use this priority list:

Rollback - If you deployed recently, roll it back. Now.
Failover - Switch to backup region, database, or cluster.
Kill switch - Disable the failing feature. Stop the bleeding.
Fix Forward - Only if rollback is riskier than a patch.

[!IMPORTANT]
Fix-forward is usually slower than rollback. If it's not trivial, prefer rollback.

Step 4: Set severity + start the response timer

Post this:

🔥 SEV2 - Checkout API errors, ~40% of transactions failing

Severity level quick reference guide showing when to use each level, example scenarios, and whether to page on-call
Severity	When to Use	Example	Page on-call?
SEV0	All customers down, business not operating	Checkout completely broken, 0% transactions	YES, immediately
SEV1	Major feature broken, significant impact	API down, 50%+ customers affected	YES, immediately
SEV2	Partial outage, some customers affected	Degraded performance, ~20% affected	Yes if ≥20% requests failing for 10+ min or checkout/revenue impacted.
SEV3	Minor issues, limited impact	Single feature broken, <5% affected	No

Escalation rules (by severity)

Escalation timeline by severity level showing when to page backup and engineering manager
Severity	Time to backup	Time to EM (if IC + backup unresponsive)
SEV0/1	5 minutes	10 minutes
SEV2	10 minutes	30 minutes
SEV3+	Handle async	Only if impact grows

Use the timer. Don't hesitate. For more on escalation paths, see our on-call rotation guide.

Step 5: Create the incident channel

One place for updates. If you jump on a call, paste a 2–3 line summary back here.

Name it clearly: #inc-checkout-api-2026-01-07 or #incidents-123

Post this as your first message:

🚨 INCIDENT DECLARED

📊 Severity: SEV2
👤 Incident Lead: @alice
🔧 Assigned Engineer: @bob
📝 Status: Investigating high error rate on checkout API
🕐 Started: 2:47 AM

💬 Updates: SEV0 10m · SEV1 15m · SEV2 15–30m · SEV3 30–60m
📌 Latest update will be pinned here

Pin that message. Latecomers shouldn't have to scroll.

Not running the incident? Stay out of the way.

Not Incident Lead or Assigned Engineer? Stay out of the way.

Don't:

DM the assigned engineer asking for updates
Hop on a call uninvited
Offer unsolicited advice

Do:

Check the pinned message
Post relevant info in the channel (logs, context, recent changes)
Let them work

The most helpful thing you can do is not add noise.

Incident Response Roles: Who Does What

Clear roles stop two things: silence and duplicate work.

Incident Lead (also called Incident Commander)

Your job:

Keep updates flowing (SEV0: 10m, SEV1: 15m, SEV2: 15–30m, SEV3: 30–60m)
Ask "what do you need?" not "what's the fix?"
Make the call: rollback vs fix forward, escalate vs wait, add people vs stay focused
Run interference so the Assigned Engineer can work

Your job is NOT:

Debugging
Writing code
Fixing the problem

If you catch yourself debugging, say this:

"I'm Incident Lead, I shouldn't be debugging. @charlie, can you take over investigation? I'll coordinate."

Assigned Engineer

Your job:

Fix the problem
Post updates when you have them (Incident Lead will remind you)
Ask for what you need

Your job is NOT:

Explaining what you're doing every 3 minutes
Managing the channel
Coordinating other people

If people keep DMing you:

"I'm heads down fixing. Check the pinned message in #incidents-123. If you need something, ping @incident-lead."

Ops Lead (optional, SEV0/1 only)

Add if: 3+ services failing OR 2+ teams involved OR access/permissions blocking progress

Don't add if: Single service, single team incident with clear path forward

🛠️ Operations Lead here. Access issues? Permission problems? Coordination across teams? Ping me.

Comms Lead (optional, SEV0/1 only)

Add if: SEV0/SEV1 OR need public status page OR support team getting hammered

Don't add if: SEV3 or no customers impacted

📣 Comms Lead here. Working on support script + status page. Engineers: focus on fixing. I'll handle the "any ETA?" questions.

Scribe (recommended for SEV0–SEV2)

Job: Capture timeline + key decisions for postmortem. In high-stakes incidents, Incident Lead is too busy to take notes.

Why split roles?

One person trying to coordinate AND debug? Both suffer.

A 50-person fintech company told us: "Splitting roles was the single biggest improvement to our MTTR. We used to have one person doing everything - coordinating, debugging, talking to support. Both suffered. Now we split it and incidents are way shorter."

[!TIP]
If nobody owns communication, customers assume the worst.

Incident Update Cadence by Severity

Post this cadence line: ⏱️ UPDATE CADENCE: SEV0 10m · SEV1 15m · SEV2 15–30m · SEV3 30–60m

📍 Current: [1 line, what users see]
🔄 Next: [specific action you're taking]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking, or "None"]

(Next update at: [time])

Every time you post an update, pin it. Remove the old pin.

Escalation: Use the Timer

Use the timer. Don't hesitate.

For "no response from IC": SEV0/1 → backup 5 min, EM 10 min. SEV2 → backup 10 min, EM 30 min. SEV3 → async.

For blocked decisions / multi-team: Page EM immediately.

If someone hesitates to escalate:

"This isn't about bothering people. It's about fixing the problem. If they're asleep and unresponsive, we need someone who isn't."

Incident Response Timeline Example

02:13 — PagerDuty: high error rate in checkout-api
02:14 — SEV1 declared in #incidents
02:15 — @alice takes Incident Lead, @bob is Assigned Engineer
02:16 — #inc-checkout-api-2026-01-07 created, incident state pinned
02:21 — Rollback decision (recent deploy noticed)
02:28 — Customer update posted + support script sent
02:35 — Rollback complete, errors dropping
02:41 — Stabilized, monitoring
03:05 — Resolved, postmortem owner assigned

52 minutes total. The key wasn't brilliant debugging. It was clear roles, regular updates, fast rollback.

Customer & Support Communication During Incidents

Support messages need four things: Issue / Customer impact / Action / Next update time

SUPPORT SCRIPT:

Issue: We're investigating an issue affecting [service/feature]
Impact: [who is affected + what they can't do]
Status: [investigating / identified / mitigating / monitoring]
Workaround: [if any, otherwise "None at this time"]
Next update: [time] (we'll post again even if ETA is unknown)

Status page updates

Guidelines for when to post public status page updates by severity level
Severity	Post public status update?	What to say
SEV0	YES, immediately	"We're investigating an issue affecting [service]. More details soon."
SEV1	YES	"We're investigating degraded performance on [feature]."
SEV2	Probably	If enough customers impacted, post an update
SEV3	No	Minor issues don't need public posts

Status page progression:

"We're investigating" → 2. "Identified the issue" → 3. "Fixing" → 4. "Resolved"

Internal stakeholders

Management will ask for updates. Give them a summary, don't let them micromanage.

Post this in #incidents-leadership or DM your EM:

👔 LEADERSHIP UPDATE

Incident: [Brief description]
Severity: [SEV0/1/2/3]
Status: [What's happening]
Who's fixing: @assigned-engineer
ETA: [If known]
Need anything: [What you need from leadership, or "Nothing, just keeping you informed"]

If leadership starts micromanaging:

"I understand this is stressful. The best thing you can do is let the team focus. I'll post an update in 15 minutes."

Closing the Incident: Resolution & Postmortem

Without proper closure, you're just firefighting. With it, you have an actual incident process.

Resolution needs: What broke / Why / What fixed / Preventing recurrence + postmortem owner + due date

✅ RESOLUTION SUMMARY:

What broke: [system/component]
Customer impact: [who/what/how long]
Why it broke: [cause, or "unknown"]
What fixed it: [rollback/fix/flag/scale]
What we'll do to prevent it: [1–3 bullets]

📝 Postmortem owner: @name
⏰ Postmortem due: [date, local time]
📎 Links: [incident channel] [dashboards] [PRs] [status page]

Assign postmortem owner

NOT necessarily the Incident Lead. They're probably tired.

📝 POSTMORTEM

@bob — you're up. Postmortem due by end of next business day (local time).
Focus on: What happened, why it happened, how to prevent it.
Incident timeline is in the pinned message.

Use our post-incident review templates to make postmortems faster.

If anyone pushes back:

"No deadline = no postmortem. Even a rough draft is better than nothing. End of next business day. If you need help, ask."

Close the incident

🔚 INCIDENT CLOSED

Thanks everyone. Clearing roles.
Channel will be archived in 24 hours (or per policy).
Postmortem discussion will happen in #postmortem-api-outage-2026-01-07

Incident Response Anti-Patterns to Avoid

These patterns show up in almost every team we talk to.

Hero mode

One person trying to fix everything alone. "I've got this."

Problem: Burnout and slower resolution. One person at 3 AM after 4 hours misses things that two fresh people would catch.

If you see hero mode:

🛑 @hero-engineer — you've been at this for 3 hours. Take a break. @backup-1 is taking over investigation for the next hour.

Silent debugging

No updates for 45 minutes while people wonder what's happening.

Problem: Latecomers ask the same questions over and over. Stakeholders DM random engineers.

If you see silent debugging:

⏰ @assigned-engineer — haven't seen an update in 30 minutes. Can you post a status? Even "still investigating" helps.

Blame hunting

"Who deployed this?" "Who wrote this code?"

Problem: Kills psychological safety. People hide incidents next time. Problems get worse.

If you see blame hunting:

🛑 STOP.

We don't care who deployed this. We care about:
1. What broke
2. Why it broke
3. How to fix it
4. How to prevent it

Save the "who" for the postmortem, and even then focus on systems not people. This maintains a [blameless culture](/learn/blameless-postmortem) where people feel safe reporting issues.

Meeting while it's burning

"Hop on a Zoom call" before you even know what's broken.

Problem: 10 people staring at each other while 1 person types. 9 people could be doing something useful.

A war room meeting during active mitigation is usually a coordination failure. Investigate first. Figure out what's broken. Only call a meeting if you need rapid, multi-person back-and-forth.

Optimism bias

"Should be fixed in 5 minutes" - repeated every 5 minutes for an hour.

Problem: Repeated missed ETAs destroy trust.

Say this instead:

⏱️ ETA: Unknown. Investigating.

Quick Reference Checklist

FIRST 5 MINUTES:

Declare it: "This is an incident, SEV2"
Name Incident Lead: "I'm taking Incident Lead" or "@alice is Incident Lead"
Name Assigned Engineer: "@bob is Assigned"
Pick severity (use cheat sheet)
Create channel: #incidents-name-date
Post template and pin it

DECISION TREE:

3+ services failing? → Add Ops Lead
2+ teams involved? → Add Ops Lead
SEV0/SEV1? → Add Comms Lead, page immediately
SEV2? → Updates every 15-30 min
SEV3? → Updates every 30-60 min
Missed update interval (SEV0/1)? → Page backup/EM
Missed update interval (SEV2)? → Check in
Stuck? → Say it early, page expert

UPDATE TEMPLATE:

📍 Current: [1 line]
🔄 Next: [specific action]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking or "None"]

ESCALATION:

SEV0/1: 5 min → "@backup — you're up"
SEV0/1: 10 min → "@em — need escalation"
SEV2: 10 min → "@backup — you're up"
SEV2: 30 min → "@em — need escalation"

CLOSEOUT:

✅ What broke, why, what fixed it, preventing recurrence
📝 Postmortem owner + deadline
🔚 Close incident

The Bottom Line

After talking to dozens of teams about their incidents, the same pattern keeps showing up: the teams that are good at this keep it simple.

Running a good incident isn't about frameworks. It's about five things:

Declare fast: 30 seconds, not 10 minutes. You can always downgrade.
Name roles: Incident Lead coordinates, Assigned Engineer fixes. Split the work.
Update regularly: On the severity cadence, pinned. No silent debugging.
Escalate when stuck: Use the response timer. Don't hero alone.
Close properly: Resolution summary, postmortem owner, done.

The best teams don't over-engineer. They don't have 50-page runbooks. They have a simple, repeatable playbook.

Keep it simple.

FAQ

How long should an incident last?

Timelines vary. If a SEV2 is running >4 hours, reassess severity, staffing, and rollback/failover options.

Do we need a call for every incident?

No. Most incidents are better handled async in Slack. Calls make sense when you need rapid back-and-forth (usually SEV0/SEV1 with multiple teams).

When should we page people?

SEV0/SEV1 always. SEV2 only if ≥20% requests failing for 10+ min or checkout/revenue impacted, or if you're stuck. Otherwise handle async.

What if we can't find the root cause?

Write "unknown" in the postmortem and make investigation an action item. Honesty beats guessing.

Do we need a postmortem for every incident?

No. SEV3s might just need a short note. SEV0/SEV1 should always get a proper postmortem. SEV2s are a judgment call—did we learn anything?

What's the difference between Incident Lead and Assigned Engineer?

Incident Lead coordinates communication, makes decisions, and keeps the incident moving. Assigned Engineer fixes the problem. Split the work so the person debugging can focus without interruption.

How do I decide severity level?

Declare first, debate later. Start with SEV2 if you're unsure. You can always escalate or downgrade. Don't waste 10 minutes debating SEV1 vs SEV2 while production is broken.

What if someone refuses to be Incident Lead?

If no one steps up in 60 seconds, YOU do it. "I'm taking Incident Lead." Someone will likely speak up if they disagree. The cost of 30 seconds of wrong leadership is zero compared to 30 minutes of no leadership.

What should I tell customers during an incident?

Four things: what's broken, who's affected, what we're doing, and when the next update comes. Even if ETA is unknown, say "next update in 15 minutes" and follow through.

Should we use a war room or handle incidents in Slack?

Most incidents (SEV2-SEV3) are better handled async in Slack. Reserve war rooms/calls for SEV0-SEV1 incidents with multiple teams where rapid back-and-forth is essential.

Want more?

Post-Incident Review Templates: What Works (3 Ready-to-Use) — Copy-paste templates for postmortems
On-Call Rotation: Primary + Backup Schedule, Escalation Rules, and Handoffs — How to set up on-call that doesn't suck
Scaling Incident Management: What We Learned from 25+ Teams — Research on how teams evolve incident management

Looking for Incident Response Automation?

We're building Runframe to automate this playbook in Slack: automatic on-call paging, structured incident channels, forced update cadence, and timeline capture—all without leaving Slack.

Join the waitlist