Blog
incident-responseincident-managementincident-lead

Incident Response Playbook: Scripts, Roles & Templates (Slack)

Slack incident response playbook: roles, severity levels, update cadence, escalation rules, and copy/paste templates for production outages. Based on real teams.

Runframe TeamJan 7, 202613 min read

Incident Response Playbook: Scripts, Roles & Templates (Slack)

Your phone lights up at 2:47 AM. PagerDuty, Slack, maybe a phone call from your CEO. Something's broken. People are noticing. And you're the one who has to fix it.

We've talked to dozens of engineering teams about incidents. The thing that comes up over and over: the debugging isn't the hard part. It's the chaos that kills you.

Who's in charge? What do we tell customers? Why are 15 people asking for updates in DMs? Should we call a Zoom? Is this SEV1 or SEV2?

The outage is the easy part. The chaos is what makes incidents last 3 hours instead of 30 minutes.

What Is Incident Response?

Incident response isn't debugging. Debugging happens after.

Incident response is what happens the second after the alert fires:

  • Declaration: Announcing the incident and severity
  • Coordination: Assigning roles (Incident Lead, Assigned Engineer)
  • Investigation: Finding and fixing the root cause
  • Communication: Keeping stakeholders and customers informed
  • Resolution: Confirming the fix and documenting what happened

Goal: restore service fast, then prevent recurrence.

Most Teams Get This Wrong

We talked to a 40-person B2B SaaS company that got hit with a SEV0 at 3 AM. Database went down. Checkout completely broken.

Want to know what went wrong?

No one declared it. People started debugging in DMs. 45 minutes in, the CEO joined Slack and asked "is anyone working on this?"

The person debugging was also trying to coordinate. They were updating support, fielding questions from leadership, AND trying to debug. Both suffered.

They kept saying "fixed in 5 minutes" - repeated every 10 minutes for 2 hours. Trust evaporated.

The incident dragged on not because the engineering problem was hard. It was because the coordination was broken.

Same team, next SEV0? They used a clear playbook. Resolved in 52 minutes. Same engineers, different process.

What Actually Works

In our conversations with engineering teams, the fast ones are consistent about seven things:

Slow Teams Do Fast Teams Do Impact
Debate severity for 10+ minutes Declare in 30 seconds: "This is SEV2, fixing if needed" Cuts coordination delay
One person tries to coordinate + debug Split roles: Lead coordinates, Engineer fixes Lower MTTR
Updates via DM or "hop on a call" Updates in channel, pinned on severity cadence Stops "any update?" pings
"Should be fixed in 5 min" (repeated) "ETA unknown, investigating" then actual ETA when known Trust maintained
Escalate after 30 min of silence Response timer by severity: no response β†’ page backup β†’ EM Faster time to fix
Forget support team until postmortem Notify support immediately: "Here's your script" Support not overwhelmed
End with "cool, it's fixed" Post resolution summary + assign postmortem owner Learning captured

Same engineers, different process.

The First 5 Minutes

Incidents live or die in the first 5 minutes. Declare fast, split roles, stabilize. The rest is details.

Step 1: Declare in 30 seconds

Post this in your incident channel:

🚨 Incident declared. Starting at SEV2 while we investigate.

Don't debate severity while production is burning.

An EM we interviewed put it bluntly: "We lost 15 minutes once arguing SEV1 vs SEV2. Meanwhile, customers couldn't check out. Just declare it. You can always downgrade later."

If anyone argues, say this:

"Let's start at SEV2. If it's worse, we escalate. If it's better, we downgrade. Arguing costs more time than fixing."

Step 2: Assign roles in 60 seconds

If no one steps up in 60 seconds, YOU do it.

πŸ‘€ I'm Incident Lead. @bob is Assigned Engineer.

Or if someone else should lead:

πŸ‘€ @alice is Incident Lead. I'll assist as needed.

Incident Lead coordinates. Assigned Engineer fixes. Split the work.

[!TIP]
If you don't pin the incident state immediately, you'll repeat yourself to every latecomer.

Step 3: Stabilize first, root cause later

Your goal is to restore service FIRST, understand SECOND. Every minute of downtime costs money and trust. Root cause analysis comes after customers are unblocked.

Use this priority list:

  1. Rollback - If you deployed recently, roll it back. Now.
  2. Failover - Switch to backup region, database, or cluster.
  3. Kill switch - Disable the failing feature. Stop the bleeding.
  4. Fix Forward - Only if rollback is riskier than a patch.

[!IMPORTANT]
Fix-forward is usually slower than rollback. If it's not trivial, prefer rollback.

Step 4: Set severity + start the response timer

Post this:

πŸ”₯ SEV2 - Checkout API errors, ~40% of transactions failing
Severity When to Use Example Page on-call?
SEV0 All customers down, business not operating Checkout completely broken, 0% transactions YES, immediately
SEV1 Major feature broken, significant impact API down, 50%+ customers affected YES, immediately
SEV2 Partial outage, some customers affected Degraded performance, ~20% affected Yes if β‰₯20% requests failing for 10+ min or checkout/revenue impacted.
SEV3 Minor issues, limited impact Single feature broken, <5% affected No

Escalation rules (by severity)

Severity Time to backup Time to EM (if IC + backup unresponsive)
SEV0/1 5 minutes 10 minutes
SEV2 10 minutes 30 minutes
SEV3+ Handle async Only if impact grows

Use the timer. Don't hesitate.

Step 5: Create the incident channel

One place for updates. If you jump on a call, paste a 2–3 line summary back here.

Name it clearly: #inc-checkout-api-2026-01-07 or #incidents-123

Post this as your first message:

🚨 INCIDENT DECLARED

πŸ“Š Severity: SEV2
πŸ‘€ Incident Lead: @alice
πŸ”§ Assigned Engineer: @bob
πŸ“ Status: Investigating high error rate on checkout API
πŸ• Started: 2:47 AM

πŸ’¬ Updates: SEV0 10m Β· SEV1 15m Β· SEV2 15–30m Β· SEV3 30–60m
πŸ“Œ Latest update will be pinned here

Pin that message. Latecomers shouldn't have to scroll.

Not running the incident? Stay out of the way.

Not Incident Lead or Assigned Engineer? Stay out of the way.

Don't:

  • DM the assigned engineer asking for updates
  • Hop on a call uninvited
  • Offer unsolicited advice

Do:

  • Check the pinned message
  • Post relevant info in the channel (logs, context, recent changes)
  • Let them work

The most helpful thing you can do is not add noise.

Roles: Who Does What

Clear roles stop two things: silence and duplicate work.

Incident Lead

Your job:

  • Keep updates flowing (SEV0: 10m, SEV1: 15m, SEV2: 15–30m, SEV3: 30–60m)
  • Ask "what do you need?" not "what's the fix?"
  • Make the call: rollback vs fix forward, escalate vs wait, add people vs stay focused
  • Run interference so the Assigned Engineer can work

Your job is NOT:

  • Debugging
  • Writing code
  • Fixing the problem

If you catch yourself debugging, say this:

"I'm Incident Lead, I shouldn't be debugging. @charlie, can you take over investigation? I'll coordinate."

Assigned Engineer

Your job:

  • Fix the problem
  • Post updates when you have them (Incident Lead will remind you)
  • Ask for what you need

Your job is NOT:

  • Explaining what you're doing every 3 minutes
  • Managing the channel
  • Coordinating other people

If people keep DMing you:

"I'm heads down fixing. Check the pinned message in #incidents-123. If you need something, ping @incident-lead."

Ops Lead (optional, SEV0/1 only)

Add if: 3+ services failing OR 2+ teams involved OR access/permissions blocking progress

Don't add if: Single service, single team incident with clear path forward

πŸ› οΈ Operations Lead here. Access issues? Permission problems? Coordination across teams? Ping me.

Comms Lead (optional, SEV0/1 only)

Add if: SEV0/SEV1 OR need public status page OR support team getting hammered

Don't add if: SEV3 or no customers impacted

πŸ“£ Comms Lead here. Working on support script + status page. Engineers: focus on fixing. I'll handle the "any ETA?" questions.

Job: Capture timeline + key decisions for postmortem. In high-stakes incidents, Incident Lead is too busy to take notes.

Why split roles?

One person trying to coordinate AND debug? Both suffer.

A 50-person fintech company told us: "Splitting roles was the single biggest improvement to our MTTR. We used to have one person doing everything - coordinating, debugging, talking to support. Both suffered. Now we split it and incidents are way shorter."

[!TIP]
If nobody owns communication, customers assume the worst.

Update Cadence

Post this cadence line: ⏱️ UPDATE CADENCE: SEV0 10m Β· SEV1 15m Β· SEV2 15–30m Β· SEV3 30–60m

πŸ“ Current: [1 line, what users see]
πŸ”„ Next: [specific action you're taking]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking, or "None"]

(Next update at: [time])

Every time you post an update, pin it. Remove the old pin.

Escalation: Use the Timer

Use the timer. Don't hesitate.

For "no response from IC": SEV0/1 β†’ backup 5 min, EM 10 min. SEV2 β†’ backup 10 min, EM 30 min. SEV3 β†’ async.

For blocked decisions / multi-team: Page EM immediately.

If someone hesitates to escalate:

"This isn't about bothering people. It's about fixing the problem. If they're asleep and unresponsive, we need someone who isn't."

Example Timeline

  • 02:13 β€” PagerDuty: high error rate in checkout-api
  • 02:14 β€” SEV1 declared in #incidents
  • 02:15 β€” @alice takes Incident Lead, @bob is Assigned Engineer
  • 02:16 β€” #inc-checkout-api-2026-01-07 created, incident state pinned
  • 02:21 β€” Rollback decision (recent deploy noticed)
  • 02:28 β€” Customer update posted + support script sent
  • 02:35 β€” Rollback complete, errors dropping
  • 02:41 β€” Stabilized, monitoring
  • 03:05 β€” Resolved, postmortem owner assigned

52 minutes total. The key wasn't brilliant debugging. It was clear roles, regular updates, fast rollback.

Customer & Support Comms

Support messages need four things: Issue / Customer impact / Action / Next update time

SUPPORT SCRIPT:

Issue: We're investigating an issue affecting [service/feature]
Impact: [who is affected + what they can't do]
Status: [investigating / identified / mitigating / monitoring]
Workaround: [if any, otherwise "None at this time"]
Next update: [time] (we'll post again even if ETA is unknown)

Status page updates

Severity Post public status update? What to say
SEV0 YES, immediately "We're investigating an issue affecting [service]. More details soon."
SEV1 YES "We're investigating degraded performance on [feature]."
SEV2 Probably If enough customers impacted, post an update
SEV3 No Minor issues don't need public posts

Status page progression:

  1. "We're investigating" β†’ 2. "Identified the issue" β†’ 3. "Fixing" β†’ 4. "Resolved"

Internal stakeholders

Management will ask for updates. Give them a summary, don't let them micromanage.

Post this in #incidents-leadership or DM your EM:

πŸ‘” LEADERSHIP UPDATE

Incident: [Brief description]
Severity: [SEV0/1/2/3]
Status: [What's happening]
Who's fixing: @assigned-engineer
ETA: [If known]
Need anything: [What you need from leadership, or "Nothing, just keeping you informed"]

If leadership starts micromanaging:

"I understand this is stressful. The best thing you can do is let the team focus. I'll post an update in 15 minutes."

Closing the Incident

Without proper closure, you're just firefighting. With it, you have an actual incident process.

Resolution needs: What broke / Why / What fixed / Preventing recurrence + postmortem owner + due date

βœ… RESOLUTION SUMMARY:

What broke: [system/component]
Customer impact: [who/what/how long]
Why it broke: [cause, or "unknown"]
What fixed it: [rollback/fix/flag/scale]
What we'll do to prevent it: [1–3 bullets]

πŸ“ Postmortem owner: @name
⏰ Postmortem due: [date, local time]
πŸ“Ž Links: [incident channel] [dashboards] [PRs] [status page]

Assign postmortem owner

NOT necessarily the Incident Lead. They're probably tired.

πŸ“ POSTMORTEM

@bob β€” you're up. Postmortem due by end of next business day (local time).
Focus on: What happened, why it happened, how to prevent it.
Incident timeline is in the pinned message.

If anyone pushes back:

"No deadline = no postmortem. Even a rough draft is better than nothing. End of next business day. If you need help, ask."

Close the incident

πŸ”š INCIDENT CLOSED

Thanks everyone. Clearing roles.
Channel will be archived in 24 hours (or per policy).
Postmortem discussion will happen in #postmortem-api-outage-2026-01-07

Anti-Patterns We Hear About

These patterns show up in almost every team we talk to.

Hero mode

One person trying to fix everything alone. "I've got this."

Problem: Burnout and slower resolution. One person at 3 AM after 4 hours misses things that two fresh people would catch.

If you see hero mode:

πŸ›‘ @hero-engineer β€” you've been at this for 3 hours. Take a break. @backup-1 is taking over investigation for the next hour.

Silent debugging

No updates for 45 minutes while people wonder what's happening.

Problem: Latecomers ask the same questions over and over. Stakeholders DM random engineers.

If you see silent debugging:

⏰ @assigned-engineer β€” haven't seen an update in 30 minutes. Can you post a status? Even "still investigating" helps.

Blame hunting

"Who deployed this?" "Who wrote this code?"

Problem: Kills psychological safety. People hide incidents next time. Problems get worse.

If you see blame hunting:

πŸ›‘ STOP.

We don't care who deployed this. We care about:
1. What broke
2. Why it broke
3. How to fix it
4. How to prevent it

Save the "who" for the postmortem, and even then focus on systems not people.

Meeting while it's burning

"Hop on a Zoom call" before you even know what's broken.

Problem: 10 people staring at each other while 1 person types. 9 people could be doing something useful.

A meeting during active mitigation is usually a coordination failure. Investigate first. Figure out what's broken. Only call a meeting if you need rapid, multi-person back-and-forth.

Optimism bias

"Should be fixed in 5 minutes" - repeated every 5 minutes for an hour.

Problem: Repeated missed ETAs destroy trust.

Say this instead:

⏱️ ETA: Unknown. Investigating.

Quick Reference

FIRST 5 MINUTES:

  • Declare it: "This is an incident, SEV2"
  • Name Incident Lead: "I'm taking Incident Lead" or "@alice is Incident Lead"
  • Name Assigned Engineer: "@bob is Assigned"
  • Pick severity (use cheat sheet)
  • Create channel: #incidents-name-date
  • Post template and pin it

DECISION TREE:

3+ services failing? β†’ Add Ops Lead
2+ teams involved? β†’ Add Ops Lead
SEV0/SEV1? β†’ Add Comms Lead, page immediately
SEV2? β†’ Updates every 15-30 min
SEV3? β†’ Updates every 30-60 min
Missed update interval (SEV0/1)? β†’ Page backup/EM
Missed update interval (SEV2)? β†’ Check in
Stuck? β†’ Say it early, page expert

UPDATE TEMPLATE:

πŸ“ Current: [1 line]
πŸ”„ Next: [specific action]
⏱️ ETA: [time or "unknown"]
🚫 Blockers: [what's blocking or "None"]

ESCALATION:

SEV0/1: 5 min β†’ "@backup β€” you're up"
SEV0/1: 10 min β†’ "@em β€” need escalation"
SEV2: 10 min β†’ "@backup β€” you're up"
SEV2: 30 min β†’ "@em β€” need escalation"

CLOSEOUT:

βœ… What broke, why, what fixed it, preventing recurrence
πŸ“ Postmortem owner + deadline
πŸ”š Close incident

The Bottom Line

After talking to dozens of teams about their incidents, the same pattern keeps showing up: the teams that are good at this keep it simple.

Running a good incident isn't about frameworks. It's about five things:

  1. Declare fast: 30 seconds, not 10 minutes. You can always downgrade.
  2. Name roles: Incident Lead coordinates, Assigned Engineer fixes. Split the work.
  3. Update regularly: On the severity cadence, pinned. No silent debugging.
  4. Escalate when stuck: Use the response timer. Don't hero alone.
  5. Close properly: Resolution summary, postmortem owner, done.

The best teams don't over-engineer. They don't have 50-page runbooks. They have a simple, repeatable playbook.

Keep it simple.

Why We're Building Runframe

We're building Runframe because incident response shouldn't live in Google Docs and scattered Slack threads. Runframe replaces manual copy-pasting with a dedicated Slack workflow: it pages the right on-call engineers, spins up a channel, and forces structured updatesβ€”all without leaving Slack.

Want to automate this playbook? Join the waitlist.

FAQ

How long should an incident last?
Timelines vary. If a SEV2 is running >4 hours, reassess severity, staffing, and rollback/failover options.

Do we need a call for every incident?
No. Most incidents are better handled async in Slack. Calls make sense when you need rapid back-and-forth (usually SEV0/SEV1 with multiple teams).

When should we page people?
SEV0/SEV1 always. SEV2 only if β‰₯20% requests failing for 10+ min or checkout/revenue impacted, or if you're stuck. Otherwise handle async.

What if we can't find the root cause?
Write "unknown" in the postmortem and make investigation an action item. Honesty beats guessing.

Do we need a postmortem for every incident?
No. SEV3s might just need a short note. SEV0/SEV1 should always get a proper postmortem. SEV2s are a judgment callβ€”did we learn anything?

Want more?

Share this article

Found this helpful? Share it with your team.

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updatesβ€”all without leaving Slack.

Join the Waitlist