incident-managementon-callescalation

Incident management for early-stage engineering teams

How to set up incident management for early-stage engineering teams. Severity levels, on-call, escalation, and postmortems in the right order. Defaults that work from 15 to 100 engineers.

Niketa SharmaMar 24, 202610 min read

At 15 engineers, you can get away without formal incident management. Someone posts in #engineering, the right person sees it, they fix it. It works until it doesn't.

Then you hit 20. Maybe 30. Someone pages the entire team at 2 AM because a staging dashboard loaded slow. The last real incident took 45 minutes before anyone figured out who should even be looking at it.

That's the inflection point. Not when things break (things always break) but when the coordination around the break starts costing more than the break itself. Some teams hit it at 15 people. Most feel it by 30.

This is the setup guide for that moment. What to set up, in what order, with opinionated defaults that work whether you're 15 engineers or 100.

TL;DR: Start with three severity levels (SEV1-3), set up weekly on-call with primary + backup, create a dedicated Slack channel per incident, wire automatic multi-channel escalation with a 5-minute timeout, and do one-page blameless postmortems within 48 hours. Skip everything else until one of these breaks.

What you'll set up

Start with three severity levels, not five

You need enough levels to make decisions, not so many that you start arguments about classification.

SEV1: Customers can't use the product. Revenue is affected. Drop everything.

SEV2: Something is degraded and customers notice, but there's a workaround. Painful, but not down.

SEV3: Minor or internal. Fix it during business hours.

Three levels. You can add SEV0 (apocalypse scenario) later when you have 50+ engineers and genuinely need a level above "drop everything." You can add SEV4 (proactive work) when you have enough incident volume to categorize prevention separately.

The mistake teams make is copying Google's severity framework on day one. They end up with five levels nobody can distinguish and spend the first 10 minutes of every incident arguing about whether it's a SEV2 or a SEV3.

When in doubt, classify higher. A SEV1 that turns out to be a SEV2 wastes some attention. A SEV2 that was actually a SEV1 wastes customer trust.

Use the severity level to decide two things: who gets paged, and how fast you need to respond. Everything else is overhead at this stage.

Related: Incident severity levels: SEV0-SEV4 matrix | Severity matrix generator

Put someone on-call before you need to

The worst time to figure out who's responsible is during an incident.

Most teams wait until after a bad incident to set up on-call. Then they scramble to build a rotation while half the team is still stressed about the last outage. Do it before you need it.

Start simple

Weekly rotation. Primary + backup. That's the minimum.

Primary is the person who gets paged first. Backup is the person who gets paged if primary doesn't respond. Without a backup, a single person in the shower or on a flight means nobody responds for 30 minutes.

Weekly works for most teams. Daily rotations are exhausting, nobody gets into a rhythm. Monthly rotations are too long, the on-call person burns out by week three and starts ignoring alerts.

Cover business hours first

If your customers are mostly in one timezone, start with business-hours on-call. You don't need 24/7 coverage on day one. Add it when your customer base or your SLAs demand it.

Acknowledge the burden

On-call is work. Engineers who carry pagers outside working hours deserve recognition. Some teams pay $200-500/week. Others give comp time. The specific mechanism matters less than the acknowledgment that being on-call is a real cost.

Treat on-call as free and the good engineers leave. It doesn't take long.

Related: On-call rotation guide | On-call schedule builder

One channel per incident

Slack is where your team already works. Use it.

When an incident fires, create a dedicated channel for it. Not a thread in #engineering. Not a DM group. A channel named something obvious, like inc-42-checkout-api-down, where everything about this incident happens. The first responder creates it, from a standard name format, so there's no ambiguity about where to go.

Why this matters

Without a dedicated channel, updates scatter across DMs, threads, and the wrong channels. Someone asks "what's the latest?" and three people answer with three different versions. The CEO finds a 20-minute-old message and panics.

With one, there's one place to look. Status updates, debugging notes, decisions, all in the same channel. If the update isn't in the incident channel, it didn't happen.

How it works in practice

Alert fires, incident channel gets created, responders get pulled in. All updates go there. When it's resolved, archive the channel.

Keep the channel public. Leadership will check it during a SEV1 whether you invite them or not. Better they read a clean timeline than ping engineers for updates mid-debug.

Related: Slack incident management: what works and what breaks

Escalation is not optional

This is where most DIY setups fail. They page once and hope.

The failure mode looks like this: an alert fires at 2 AM. The on-call engineer's phone is on silent. Or they're sick. Or they looked at the notification and fell back asleep. Nobody else knows. Twenty minutes later, customers are complaining on Twitter and your CEO is texting the CTO asking what's happening.

Automatic, not manual

If the on-call person doesn't acknowledge within 5 minutes, escalate. Automatically. Don't rely on someone noticing and manually paging the backup. At 2 AM, nobody is watching.

What you want is an escalation chain where each step gets harder to ignore:

  1. 0 min: Slack DM + push notification to primary on-call
  2. 2-5 min: SMS and voice call to primary if still unacknowledged
  3. 5 min: Page the backup on-call, all channels
  4. If neither responds: Escalate to engineering manager

Notice each step uses a more interruptive channel than the last. If your escalation sends another Slack message to someone who already missed the first one, you haven't escalated. You've just been louder in the same room.

Postmortems that people actually read

One page

Keep the postmortem to one page. Nobody reads the five-page ones, so they're worse than useless. They consume time to write and teach nothing because nobody opens them.

Answer three questions:

  1. What happened? Timeline. What broke, when, what was the impact.
  2. Why did it happen? Root cause. Not "the server crashed" but why the server crashed and why you didn't catch it earlier.
  3. What are we changing? 1-3 specific action items with owners and deadlines.

If you need more detail for a major incident, add an appendix. But the core document that people read should fit on one page.

48-hour rule

If the postmortem isn't written within 48 hours, it won't get written. Details fade, people move on, the next sprint starts and nobody circles back.

Assign an owner immediately after the incident resolves. Not "the team," a specific person with a specific deadline.

Blameless is not optional

The first time someone gets called out in a postmortem, nobody writes honest ones again. Engineers will sanitize everything. The postmortem becomes theater, a document that exists to prove you did a postmortem, not to prevent the next incident.

Focus on systems, not people. "The deploy went out without a canary" not "Alex deployed without checking."

Not every incident needs one. SEV1: always. SEV2: judgment call, did you learn something? SEV3: a brief note in the incident timeline is enough.

Related: Post-incident review templates (3 ready-to-use)

What to skip (for now)

The biggest risk at this stage isn't missing a feature. It's overbuilding process that nobody follows.

Runbooks and playbooks can wait. You don't have enough incident patterns yet. After you've seen the same type of incident three times, write a runbook for it. Before that, you're writing fiction.

Don't bother with workflow automation either. Do the process manually for 20 incidents first. You'll learn what actually needs automating versus what you assumed would.

SLOs and error budgets? At 30 engineers, you already know your service is unreliable. You don't need a dashboard to confirm it. Unless you're selling to enterprise or running infra-heavy systems, in which case basic SLO thinking earlier doesn't hurt. But formal error budgets can wait until 100+ engineers when you need to make real tradeoffs between reliability and shipping.

For most B2B teams at this stage, reliable escalation matters more than a status page. If your customers expect proactive comms, use a hosted service. Don't build one.

And skip incident analytics for now. MTTR dashboards are meaningless if your escalation doesn't work and your postmortems aren't happening. Fix the process first.

Incident management setup checklist

Set these up in this order:

  1. Three severity levels (SEV1, SEV2, SEV3). Classify fast, default higher.
  2. On-call rotation with primary + backup, weekly. Acknowledge the burden.
  3. One dedicated Slack channel per incident, kept public.
  4. Automatic escalation across multiple channels. 5-minute timeout before it moves up.
  5. One-page postmortems within 48 hours. Blameless. Specific owner.

Skip everything else until one of these breaks.

The goal is making the next incident less chaotic than the last one. Run these for a few months and you'll know what needs to change, because you'll have real incidents telling you.

If you want this setup without building it yourself, Runframe handles severity levels, on-call scheduling, multi-channel escalation, and postmortems out of the box. Free to start.

Once your process is running, read how teams scale incident management past 50 engineers for what comes next.

Common questions

When does a team need formal incident management?
When coordination during incidents starts costing more time than the incident itself. For most teams, that's somewhere between 20 and 40 engineers. If two people debugged the same thing independently, or if leadership asked for updates that nobody could provide, you're there.
How many people should be on an on-call rotation?
Minimum four for a weekly rotation, so each person is on-call one week per month. Fewer than that and burnout becomes real. If you only have two or three people who can respond, start with business-hours-only and staff up.
Do we need a tool or can we use Slack?
Slack handles coordination well. It doesn't handle paging, escalation, on-call scheduling, or audit trails. Most teams outgrow pure-Slack incident management around 20-25 engineers, sometimes earlier if you have enterprise customers or SLA commitments. At that point, you need something that pages people reliably through multiple channels, tracks who's on-call, and escalates automatically when nobody responds. That's the gap Runframe is built for.
How often should we do postmortems?
Every SEV1 gets a postmortem. SEV2 gets one if you learned something or if it affected customers. SEV3 doesn't need a formal postmortem, a note in the incident timeline is fine. Don't postmortem everything or the team will burn out on process.
Should we build our own incident management tooling?
At 20-50 engineers, almost certainly not. The cost of building and maintaining incident tooling (Slack bots, paging logic, escalation chains, on-call scheduling) adds up faster than a subscription. We broke down the real costs in our build, open source, or buy guide.

Share this article

Found this helpful? Share it with your team.

Related Articles

Mar 16, 2026

Your Agent Can Manage Incidents Now

We shipped an MCP server for managing incidents from Claude Code and Cursor. On-call, escalation, paging, and postmortems. Here's how we designed it for agents that live in your IDE.

Read more
Mar 13, 2026

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

OpsGenie shuts down April 2027. Two vendors got acquired, one went maintenance-only. Here's what's left, what it really costs, and how to decide.

Read more
Mar 10, 2026

Build, Open Source, or Buy Incident Management in 2026

Back-of-napkin 3-year TCO for a 20-person team: build ($233K to $395K), open source ($99K to $360K), or buy ($11K to $83K). What AI changes and what it doesn't.

Read more
Mar 8, 2026

Slack Incident Management: What Works, What Breaks, and When You Need a Tool

A practical guide to running incidents in Slack. What actually works at different team sizes, where Slack falls apart, and when to move beyond emoji reactions and manual channels.

Read more
Mar 5, 2026

PagerDuty Alternatives 2026: Pricing and Features Compared

Which PagerDuty alternative fits your team? Pricing, integrations, and on-call compared for teams from 10 to 200+ engineers.

Read more
Feb 1, 2026

Incident Communication Templates: 8 Free Examples [Copy-Paste]

Stop writing updates at 2 AM. 8 free templates for status pages, exec emails, customer updates, and social posts. Copy and use in 2 minutes.

Read more
Jan 26, 2026

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Step-by-step export guide, timeline, and pricing for 7 alternatives. Most teams need 6-8 weeks.

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 17, 2026

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Stop debating SEV1 vs P1. Covers both SEV and P0–P4 frameworks. Free copy-paste matrix, decision tree, and rollout plan.

Read more
Jan 15, 2026

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

State of Incident Management 2026: Toil Rose 30% Despite AI

~$9.4M wasted per 250 engineers annually. Toil rose 30% in 2025, the first increase in 5 years. Data from 20+ reports and 25+ team interviews.

Read more
Jan 7, 2026

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Read more
Jan 2, 2026

On-Call Rotation: Schedules, Handoffs & Templates

Build a fair on-call rotation with schedule templates, a 2-minute handoff checklist, and primary/backup examples. Includes a free on-call builder tool.

Read more
Dec 29, 2025

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Stop writing postmortems nobody reads. 3 blameless templates (15-min, standard, comprehensive). Copy in one click, done in 48 hours.

Read more
Dec 22, 2025

Incident Coordination Guide: Cut Context Switching and Improve Response Time

Outages cost less than the coordination chaos around them. The 10-minute framework 25+ teams use to reduce coordination overhead and context switching during incidents.

Read more
Dec 15, 2025

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates, all without leaving Slack.

Get Started Free