incident-responseincident-managementincident-lead

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Runframe TeamJan 7, 202613 min read

Your phone lights up at 2:47 AM. PagerDuty, Slack, maybe a phone call from your CEO. Something's broken. People are noticing. And you're the one who has to fix it.

We've talked to dozens of engineering teams about incidents. The thing that comes up over and over: the debugging isn't the hard part—it's the coordination. See our incident coordination guide on reducing context switching across tools and improving MTTR for more on why coordination matters more than speed.

Who's in charge? What do we tell customers? Why are 15 people asking for updates in DMs? Should we call a Zoom? Is this SEV1 or SEV2?

The outage is the easy part. The chaos is what makes incidents last 3 hours instead of 30 minutes.

What Is Incident Response?

Incident response isn't debugging. Debugging happens after.

Incident response is what happens the second after the alert fires:

  • Declaration: Announcing the incident and severity
  • Coordination: Assigning roles (Incident Lead, Assigned Engineer)
  • Investigation: Finding and fixing the root cause
  • Communication: Keeping stakeholders and customers informed
  • Resolution: Confirming the fix and documenting what happened

Goal: restore service fast, then prevent recurrence.

Most Teams Get This Wrong

We talked to a 40-person B2B SaaS company that got hit with a SEV0 at 3 AM. Database went down. Checkout completely broken.

Want to know what went wrong?

No one declared it. People started debugging in DMs. 45 minutes in, the CEO joined Slack and asked "is anyone working on this?"

The person debugging was also trying to coordinate. They were updating support, fielding questions from leadership, AND trying to debug. Both suffered.

They kept saying "fixed in 5 minutes" - repeated every 10 minutes for 2 hours. Trust evaporated.

The incident dragged on not because the engineering problem was hard. It was because the coordination was broken.

Same team, next SEV0? They used a clear playbook. Resolved in 52 minutes. Same engineers, different process.

Incident Response Approaches Compared

Comparison of incident response approaches showing speed, coordination quality, team size fit, and failure conditions
Approach Speed Coordination Works For Breaks When
No playbook Slow Chaotic <10 people Any serious incident
Ad-hoc responses Variable Inconsistent <30 people Multiple concurrent incidents
Clear playbook (this approach) Fast Structured 20-200 people Nobody follows it
Enterprise ITSM Slow Heavy process 200+ people Too much overhead for smaller teams

Teams with clear playbooks resolve incidents 40-60% faster than ad-hoc responses.

What You'll Learn

What Actually Works

In our conversations with engineering teams, the fast ones are consistent about seven things:

Key behavioral differences between slow and fast incident response teams and their impact
Slow Teams Do Fast Teams Do Impact
Debate severity for 10+ minutes Declare in 30 seconds: "This is SEV2, fixing if needed" Cuts coordination delay
One person tries to coordinate + debug Split roles: Lead coordinates, Engineer fixes Lower MTTR
Updates via DM or "hop on a call" Updates in channel, pinned on severity cadence Stops "any update?" pings
"Should be fixed in 5 min" (repeated) "ETA unknown, investigating" then actual ETA when known Trust maintained
Escalate after 30 min of silence Response timer by severity: no response → page backup → EM Faster time to fix
Forget support team until postmortem Notify support immediately: "Here's your script" Support not overwhelmed
End with "cool, it's fixed" Post resolution summary + assign postmortem owner Learning captured

Same engineers, different process.

The First 5 Minutes

Incidents live or die in the first 5 minutes. Declare fast, split roles, stabilize. The rest is details.

Step 1: Declare in 30 seconds

Post this in your incident channel:

🚨 Incident declared. Starting at SEV2 while we investigate.

Don't debate severity while production is burning.

An EM we interviewed put it bluntly: "We lost 15 minutes once arguing SEV1 vs SEV2. Meanwhile, customers couldn't check out. Just declare it. You can always downgrade later."

If anyone argues, say this:

"Let's start at SEV2. If it's worse, we escalate. If it's better, we downgrade. Arguing costs more time than fixing."

Step 2: Assign roles in 60 seconds

If no one steps up in 60 seconds, YOU do it.

👤 I'm Incident Lead. @bob is Assigned Engineer.

Or if someone else should lead:

👤 @alice is Incident Lead. I'll assist as needed.

Incident Lead coordinates. Assigned Engineer fixes. Split the work.

[!TIP]
If you don't pin the incident state immediately, you'll repeat yourself to every latecomer.

Step 3: Stabilize first, root cause later

Your goal is to restore service FIRST, understand SECOND. Every minute of downtime costs money and trust. Root cause analysis comes after customers are unblocked.

Use this priority list:

  1. Rollback - If you deployed recently, roll it back. Now.
  2. Failover - Switch to backup region, database, or cluster.
  3. Kill switch - Disable the failing feature. Stop the bleeding.
  4. Fix Forward - Only if rollback is riskier than a patch.

[!IMPORTANT]
Fix-forward is usually slower than rollback. If it's not trivial, prefer rollback.

Step 4: Set severity + start the response timer

Post this:

🔥 SEV2 - Checkout API errors, ~40% of transactions failing
Severity level quick reference guide showing when to use each level, example scenarios, and whether to page on-call
Severity When to Use Example Page on-call?
SEV0 All customers down, business not operating Checkout completely broken, 0% transactions YES, immediately
SEV1 Major feature broken, significant impact API down, 50%+ customers affected YES, immediately
SEV2 Partial outage, some customers affected Degraded performance, ~20% affected Yes if ≥20% requests failing for 10+ min or checkout/revenue impacted.
SEV3 Minor issues, limited impact Single feature broken, <5% affected No

Escalation rules (by severity)

Escalation timeline by severity level showing when to page backup and engineering manager
Severity Time to backup Time to EM (if IC + backup unresponsive)
SEV0/1 5 minutes 10 minutes
SEV2 10 minutes 30 minutes
SEV3+ Handle async Only if impact grows

Use the timer. Don't hesitate. For more on escalation paths, see our on-call rotation guide.

Step 5: Create the incident channel

One place for updates. If you jump on a call, paste a 2–3 line summary back here.

Name it clearly: #inc-checkout-api-2026-01-07 or #incidents-123

Post this as your first message:

🚨 INCIDENT DECLARED

📊 Severity: SEV2
👤 Incident Lead: @alice
🔧 Assigned Engineer: @bob
📝 Status: Investigating high error rate on checkout API
🕐 Started: 2:47 AM

💬 Updates: SEV0 10m · SEV1 15m · SEV2 15–30m · SEV3 30–60m
📌 Latest update will be pinned here

Pin that message. Latecomers shouldn't have to scroll.

Not running the incident? Stay out of the way.

Not Incident Lead or Assigned Engineer? Stay out of the way.

Don't:

  • DM the assigned engineer asking for updates
  • Hop on a call uninvited
  • Offer unsolicited advice

Do:

  • Check the pinned message
  • Post relevant info in the channel (logs, context, recent changes)
  • Let them work

The most helpful thing you can do is not add noise.

Incident Response Roles: Who Does What

Clear roles stop two things: silence and duplicate work.

Incident Lead (also called Incident Commander)

Your job:

  • Keep updates flowing (SEV0: 10m, SEV1: 15m, SEV2: 15–30m, SEV3: 30–60m)
  • Ask "what do you need?" not "what's the fix?"
  • Make the call: rollback vs fix forward, escalate vs wait, add people vs stay focused
  • Run interference so the Assigned Engineer can work

Your job is NOT:

  • Debugging
  • Writing code
  • Fixing the problem

If you catch yourself debugging, say this:

"I'm Incident Lead, I shouldn't be debugging. @charlie, can you take over investigation? I'll coordinate."

Assigned Engineer

Your job:

  • Fix the problem
  • Post updates when you have them (Incident Lead will remind you)
  • Ask for what you need

Your job is NOT:

  • Explaining what you're doing every 3 minutes
  • Managing the channel
  • Coordinating other people

If people keep DMing you:

"I'm heads down fixing. Check the pinned message in #incidents-123. If you need something, ping @incident-lead."

Ops Lead (optional, SEV0/1 only)

Add if: 3+ services failing OR 2+ teams involved OR access/permissions blocking progress

Don't add if: Single service, single team incident with clear path forward

🛠️ Operations Lead here. Access issues? Permission problems? Coordination across teams? Ping me.

Comms Lead (optional, SEV0/1 only)

Add if: SEV0/SEV1 OR need public status page OR support team getting hammered

Don't add if: SEV3 or no customers impacted

📣 Comms Lead here. Working on support script + status page. Engineers: focus on fixing. I'll handle the "any ETA?" questions.

Job: Capture timeline + key decisions for postmortem. In high-stakes incidents, Incident Lead is too busy to take notes.

Why split roles?

One person trying to coordinate AND debug? Both suffer.

A 50-person fintech company told us: "Splitting roles was the single biggest improvement to our MTTR. We used to have one person doing everything - coordinating, debugging, talking to support. Both suffered. Now we split it and incidents are way shorter."

[!TIP]
If nobody owns communication, customers assume the worst.

Incident Update Cadence by Severity

Post this cadence line: ⏱️ UPDATE CADENCE: SEV0 10m · SEV1 15m · SEV2 15–30m · SEV3 30–60m

📍 Current: [1 line, what users see]
🔄 Next: [specific action you're taking]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking, or "None"]

(Next update at: [time])

Every time you post an update, pin it. Remove the old pin.

Escalation: Use the Timer

Use the timer. Don't hesitate.

For "no response from IC": SEV0/1 → backup 5 min, EM 10 min. SEV2 → backup 10 min, EM 30 min. SEV3 → async.

For blocked decisions / multi-team: Page EM immediately.

If someone hesitates to escalate:

"This isn't about bothering people. It's about fixing the problem. If they're asleep and unresponsive, we need someone who isn't."

Incident Response Timeline Example

  • 02:13 — PagerDuty: high error rate in checkout-api
  • 02:14 — SEV1 declared in #incidents
  • 02:15 — @alice takes Incident Lead, @bob is Assigned Engineer
  • 02:16 — #inc-checkout-api-2026-01-07 created, incident state pinned
  • 02:21 — Rollback decision (recent deploy noticed)
  • 02:28 — Customer update posted + support script sent
  • 02:35 — Rollback complete, errors dropping
  • 02:41 — Stabilized, monitoring
  • 03:05 — Resolved, postmortem owner assigned

52 minutes total. The key wasn't brilliant debugging. It was clear roles, regular updates, fast rollback.

Customer & Support Communication During Incidents

Support messages need four things: Issue / Customer impact / Action / Next update time

SUPPORT SCRIPT:

Issue: We're investigating an issue affecting [service/feature]
Impact: [who is affected + what they can't do]
Status: [investigating / identified / mitigating / monitoring]
Workaround: [if any, otherwise "None at this time"]
Next update: [time] (we'll post again even if ETA is unknown)

Status page updates

Guidelines for when to post public status page updates by severity level
Severity Post public status update? What to say
SEV0 YES, immediately "We're investigating an issue affecting [service]. More details soon."
SEV1 YES "We're investigating degraded performance on [feature]."
SEV2 Probably If enough customers impacted, post an update
SEV3 No Minor issues don't need public posts

Status page progression:

  1. "We're investigating" → 2. "Identified the issue" → 3. "Fixing" → 4. "Resolved"

Internal stakeholders

Management will ask for updates. Give them a summary, don't let them micromanage.

Post this in #incidents-leadership or DM your EM:

👔 LEADERSHIP UPDATE

Incident: [Brief description]
Severity: [SEV0/1/2/3]
Status: [What's happening]
Who's fixing: @assigned-engineer
ETA: [If known]
Need anything: [What you need from leadership, or "Nothing, just keeping you informed"]

If leadership starts micromanaging:

"I understand this is stressful. The best thing you can do is let the team focus. I'll post an update in 15 minutes."

Closing the Incident: Resolution & Postmortem

Without proper closure, you're just firefighting. With it, you have an actual incident process.

Resolution needs: What broke / Why / What fixed / Preventing recurrence + postmortem owner + due date

✅ RESOLUTION SUMMARY:

What broke: [system/component]
Customer impact: [who/what/how long]
Why it broke: [cause, or "unknown"]
What fixed it: [rollback/fix/flag/scale]
What we'll do to prevent it: [1–3 bullets]

📝 Postmortem owner: @name
⏰ Postmortem due: [date, local time]
📎 Links: [incident channel] [dashboards] [PRs] [status page]

Assign postmortem owner

NOT necessarily the Incident Lead. They're probably tired.

📝 POSTMORTEM

@bob — you're up. Postmortem due by end of next business day (local time).
Focus on: What happened, why it happened, how to prevent it.
Incident timeline is in the pinned message.

Use our post-incident review templates to make postmortems faster.

If anyone pushes back:

"No deadline = no postmortem. Even a rough draft is better than nothing. End of next business day. If you need help, ask."

Close the incident

🔚 INCIDENT CLOSED

Thanks everyone. Clearing roles.
Channel will be archived in 24 hours (or per policy).
Postmortem discussion will happen in #postmortem-api-outage-2026-01-07

Incident Response Anti-Patterns to Avoid

These patterns show up in almost every team we talk to.

Hero mode

One person trying to fix everything alone. "I've got this."

Problem: Burnout and slower resolution. One person at 3 AM after 4 hours misses things that two fresh people would catch.

If you see hero mode:

🛑 @hero-engineer — you've been at this for 3 hours. Take a break. @backup-1 is taking over investigation for the next hour.

Silent debugging

No updates for 45 minutes while people wonder what's happening.

Problem: Latecomers ask the same questions over and over. Stakeholders DM random engineers.

If you see silent debugging:

⏰ @assigned-engineer — haven't seen an update in 30 minutes. Can you post a status? Even "still investigating" helps.

Blame hunting

"Who deployed this?" "Who wrote this code?"

Problem: Kills psychological safety. People hide incidents next time. Problems get worse.

If you see blame hunting:

🛑 STOP.

We don't care who deployed this. We care about:
1. What broke
2. Why it broke
3. How to fix it
4. How to prevent it

Save the "who" for the postmortem, and even then focus on systems not people. This maintains a [blameless culture](/learn/blameless-postmortem) where people feel safe reporting issues.

Meeting while it's burning

"Hop on a Zoom call" before you even know what's broken.

Problem: 10 people staring at each other while 1 person types. 9 people could be doing something useful.

A war room meeting during active mitigation is usually a coordination failure. Investigate first. Figure out what's broken. Only call a meeting if you need rapid, multi-person back-and-forth.

Optimism bias

"Should be fixed in 5 minutes" - repeated every 5 minutes for an hour.

Problem: Repeated missed ETAs destroy trust.

Say this instead:

⏱️ ETA: Unknown. Investigating.

Quick Reference Checklist

FIRST 5 MINUTES:

  • Declare it: "This is an incident, SEV2"
  • Name Incident Lead: "I'm taking Incident Lead" or "@alice is Incident Lead"
  • Name Assigned Engineer: "@bob is Assigned"
  • Pick severity (use cheat sheet)
  • Create channel: #incidents-name-date
  • Post template and pin it

DECISION TREE:

3+ services failing? → Add Ops Lead
2+ teams involved? → Add Ops Lead
SEV0/SEV1? → Add Comms Lead, page immediately
SEV2? → Updates every 15-30 min
SEV3? → Updates every 30-60 min
Missed update interval (SEV0/1)? → Page backup/EM
Missed update interval (SEV2)? → Check in
Stuck? → Say it early, page expert

UPDATE TEMPLATE:

📍 Current: [1 line]
🔄 Next: [specific action]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking or "None"]

ESCALATION:

SEV0/1: 5 min → "@backup — you're up"
SEV0/1: 10 min → "@em — need escalation"
SEV2: 10 min → "@backup — you're up"
SEV2: 30 min → "@em — need escalation"

CLOSEOUT:

✅ What broke, why, what fixed it, preventing recurrence
📝 Postmortem owner + deadline
🔚 Close incident

The Bottom Line

After talking to dozens of teams about their incidents, the same pattern keeps showing up: the teams that are good at this keep it simple.

Running a good incident isn't about frameworks. It's about five things:

  1. Declare fast: 30 seconds, not 10 minutes. You can always downgrade.
  2. Name roles: Incident Lead coordinates, Assigned Engineer fixes. Split the work.
  3. Update regularly: On the severity cadence, pinned. No silent debugging.
  4. Escalate when stuck: Use the response timer. Don't hero alone.
  5. Close properly: Resolution summary, postmortem owner, done.

The best teams don't over-engineer. They don't have 50-page runbooks. They have a simple, repeatable playbook.

Keep it simple.

FAQ

How long should an incident last?
Timelines vary. If a SEV2 is running >4 hours, reassess severity, staffing, and rollback/failover options.
Do we need a call for every incident?
No. Most incidents are better handled async in Slack. Calls make sense when you need rapid back-and-forth (usually SEV0/SEV1 with multiple teams).
When should we page people?
SEV0/SEV1 always. SEV2 only if ≥20% requests failing for 10+ min or checkout/revenue impacted, or if you're stuck. Otherwise handle async.
What if we can't find the root cause?
Write "unknown" in the postmortem and make investigation an action item. Honesty beats guessing.
Do we need a postmortem for every incident?
No. SEV3s might just need a short note. SEV0/SEV1 should always get a proper postmortem. SEV2s are a judgment call—did we learn anything?
What's the difference between Incident Lead and Assigned Engineer?
Incident Lead coordinates communication, makes decisions, and keeps the incident moving. Assigned Engineer fixes the problem. Split the work so the person debugging can focus without interruption.
How do I decide severity level?
Declare first, debate later. Start with SEV2 if you're unsure. You can always escalate or downgrade. Don't waste 10 minutes debating SEV1 vs SEV2 while production is broken.
What if someone refuses to be Incident Lead?
If no one steps up in 60 seconds, YOU do it. "I'm taking Incident Lead." Someone will likely speak up if they disagree. The cost of 30 seconds of wrong leadership is zero compared to 30 minutes of no leadership.
What should I tell customers during an incident?
Four things: what's broken, who's affected, what we're doing, and when the next update comes. Even if ETA is unknown, say "next update in 15 minutes" and follow through.
Should we use a war room or handle incidents in Slack?
Most incidents (SEV2-SEV3) are better handled async in Slack. Reserve war rooms/calls for SEV0-SEV1 incidents with multiple teams where rapid back-and-forth is essential.

Want more?

Looking for Incident Response Automation?

We're building Runframe to automate this playbook in Slack: automatic on-call paging, structured incident channels, forced update cadence, and timeline capture—all without leaving Slack.

Join the waitlist

Share this article

Found this helpful? Share it with your team.

Related Articles

Feb 18, 2026

Build vs Buy Incident Management: 2026 Cost & Decision Framework

A defensible 2026 build vs buy framework for incident management: real TCO ranges, reliability gotchas, hybrid options, and a decision checklist.

Read more
Feb 1, 2026

Incident Communication: 8 Copy-Paste Templates for Status, Email & Execs

Stop writing updates at 2 AM. Copy-paste templates for status pages, emails, exec updates, and social posts. Plus cadence and ownership rules for SREs.

Read more
Jan 26, 2026

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Real migration timelines, export guides, and pricing for 7 alternatives (PagerDuty, incident.io, Squadcast).

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 17, 2026

Incident Severity Matrix (SEV0-SEV4): Free Template & Generator

Stop arguing over SEV1 vs SEV2. Use our SEV0-SEV4 matrix and decision tree to standardize your incident classification and reduce alert fatigue.

Read more
Jan 15, 2026

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

2026 State of Incident Management Report: Key Statistics & Benchmarks

Operational toil rose to 30% in 2025 despite AI. Get the latest data on burnout, alert fatigue, and why engineering teams are struggling to keep up.

Read more
Jan 2, 2026

On-Call Rotation Templates & The 2-Minute Handoff Guide

Move your on-call from a Google Sheet to a repeatable system. Learn our 2-minute handoff framework and get templates for primary and backup rotations.

Read more
Dec 29, 2025

Post-Incident Review Templates: 3 Real-World Examples (Make Copy)

Skip the 5-page docs nobody reads. Use our 3 ready-to-use postmortem templates and examples to drive real learning and stop recurring incidents.

Read more
Dec 22, 2025

Reducing Context Switching: The 10-Minute Incident Coordination Framework for Slack

Outages are expensive; coordination is harder. Use our 10-minute framework to cut context switching and speed up MTTR during Slack-based incidents.

Read more
Dec 15, 2025

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates—all without leaving Slack.