At 15 engineers, you can get away without formal incident management. Someone posts in #engineering, the right person sees it, they fix it. It works until it doesn't.
Then you hit 20. Maybe 30. Someone pages the entire team at 2 AM because a staging dashboard loaded slow. The last real incident took 45 minutes before anyone figured out who should even be looking at it.
That's the inflection point. Not when things break (things always break) but when the coordination around the break starts costing more than the break itself. Some teams hit it at 15 people. Most feel it by 30.
This is the setup guide for that moment. What to set up, in what order, with opinionated defaults that work whether you're 15 engineers or 100.
TL;DR: Start with three severity levels (SEV1-3), set up weekly on-call with primary + backup, create a dedicated Slack channel per incident, wire automatic multi-channel escalation with a 5-minute timeout, and do one-page blameless postmortems within 48 hours. Skip everything else until one of these breaks.
What you'll set up
- Three severity levels, enough to triage, not enough to argue about
- On-call rotation, primary + backup, weekly, with real escalation
- Incident channels, dedicated Slack channel per incident
- Escalation that works, multi-channel, automatic, no gaps
- Short postmortems, one page, 48 hours, blameless
- What to skip, the stuff that doesn't matter yet
Start with three severity levels, not five
You need enough levels to make decisions, not so many that you start arguments about classification.
SEV1: Customers can't use the product. Revenue is affected. Drop everything.
SEV2: Something is degraded and customers notice, but there's a workaround. Painful, but not down.
SEV3: Minor or internal. Fix it during business hours.
Three levels. You can add SEV0 (apocalypse scenario) later when you have 50+ engineers and genuinely need a level above "drop everything." You can add SEV4 (proactive work) when you have enough incident volume to categorize prevention separately.
The mistake teams make is copying Google's severity framework on day one. They end up with five levels nobody can distinguish and spend the first 10 minutes of every incident arguing about whether it's a SEV2 or a SEV3.
When in doubt, classify higher. A SEV1 that turns out to be a SEV2 wastes some attention. A SEV2 that was actually a SEV1 wastes customer trust.
Use the severity level to decide two things: who gets paged, and how fast you need to respond. Everything else is overhead at this stage.
Related: Incident severity levels: SEV0-SEV4 matrix | Severity matrix generator
Put someone on-call before you need to
The worst time to figure out who's responsible is during an incident.
Most teams wait until after a bad incident to set up on-call. Then they scramble to build a rotation while half the team is still stressed about the last outage. Do it before you need it.
Start simple
Weekly rotation. Primary + backup. That's the minimum.
Primary is the person who gets paged first. Backup is the person who gets paged if primary doesn't respond. Without a backup, a single person in the shower or on a flight means nobody responds for 30 minutes.
Weekly works for most teams. Daily rotations are exhausting, nobody gets into a rhythm. Monthly rotations are too long, the on-call person burns out by week three and starts ignoring alerts.
Cover business hours first
If your customers are mostly in one timezone, start with business-hours on-call. You don't need 24/7 coverage on day one. Add it when your customer base or your SLAs demand it.
Acknowledge the burden
On-call is work. Engineers who carry pagers outside working hours deserve recognition. Some teams pay $200-500/week. Others give comp time. The specific mechanism matters less than the acknowledgment that being on-call is a real cost.
Treat on-call as free and the good engineers leave. It doesn't take long.
Related: On-call rotation guide | On-call schedule builder
One channel per incident
Slack is where your team already works. Use it.
When an incident fires, create a dedicated channel for it. Not a thread in #engineering. Not a DM group. A channel named something obvious, like inc-42-checkout-api-down, where everything about this incident happens. The first responder creates it, from a standard name format, so there's no ambiguity about where to go.
Why this matters
Without a dedicated channel, updates scatter across DMs, threads, and the wrong channels. Someone asks "what's the latest?" and three people answer with three different versions. The CEO finds a 20-minute-old message and panics.
With one, there's one place to look. Status updates, debugging notes, decisions, all in the same channel. If the update isn't in the incident channel, it didn't happen.
How it works in practice
Alert fires, incident channel gets created, responders get pulled in. All updates go there. When it's resolved, archive the channel.
Keep the channel public. Leadership will check it during a SEV1 whether you invite them or not. Better they read a clean timeline than ping engineers for updates mid-debug.
Related: Slack incident management: what works and what breaks
Escalation is not optional
This is where most DIY setups fail. They page once and hope.
The failure mode looks like this: an alert fires at 2 AM. The on-call engineer's phone is on silent. Or they're sick. Or they looked at the notification and fell back asleep. Nobody else knows. Twenty minutes later, customers are complaining on Twitter and your CEO is texting the CTO asking what's happening.
Automatic, not manual
If the on-call person doesn't acknowledge within 5 minutes, escalate. Automatically. Don't rely on someone noticing and manually paging the backup. At 2 AM, nobody is watching.
What you want is an escalation chain where each step gets harder to ignore:
- 0 min: Slack DM + push notification to primary on-call
- 2-5 min: SMS and voice call to primary if still unacknowledged
- 5 min: Page the backup on-call, all channels
- If neither responds: Escalate to engineering manager
Notice each step uses a more interruptive channel than the last. If your escalation sends another Slack message to someone who already missed the first one, you haven't escalated. You've just been louder in the same room.
Postmortems that people actually read
One page
Keep the postmortem to one page. Nobody reads the five-page ones, so they're worse than useless. They consume time to write and teach nothing because nobody opens them.
Answer three questions:
- What happened? Timeline. What broke, when, what was the impact.
- Why did it happen? Root cause. Not "the server crashed" but why the server crashed and why you didn't catch it earlier.
- What are we changing? 1-3 specific action items with owners and deadlines.
If you need more detail for a major incident, add an appendix. But the core document that people read should fit on one page.
48-hour rule
If the postmortem isn't written within 48 hours, it won't get written. Details fade, people move on, the next sprint starts and nobody circles back.
Assign an owner immediately after the incident resolves. Not "the team," a specific person with a specific deadline.
Blameless is not optional
The first time someone gets called out in a postmortem, nobody writes honest ones again. Engineers will sanitize everything. The postmortem becomes theater, a document that exists to prove you did a postmortem, not to prevent the next incident.
Focus on systems, not people. "The deploy went out without a canary" not "Alex deployed without checking."
Not every incident needs one. SEV1: always. SEV2: judgment call, did you learn something? SEV3: a brief note in the incident timeline is enough.
Related: Post-incident review templates (3 ready-to-use)
What to skip (for now)
The biggest risk at this stage isn't missing a feature. It's overbuilding process that nobody follows.
Runbooks and playbooks can wait. You don't have enough incident patterns yet. After you've seen the same type of incident three times, write a runbook for it. Before that, you're writing fiction.
Don't bother with workflow automation either. Do the process manually for 20 incidents first. You'll learn what actually needs automating versus what you assumed would.
SLOs and error budgets? At 30 engineers, you already know your service is unreliable. You don't need a dashboard to confirm it. Unless you're selling to enterprise or running infra-heavy systems, in which case basic SLO thinking earlier doesn't hurt. But formal error budgets can wait until 100+ engineers when you need to make real tradeoffs between reliability and shipping.
For most B2B teams at this stage, reliable escalation matters more than a status page. If your customers expect proactive comms, use a hosted service. Don't build one.
And skip incident analytics for now. MTTR dashboards are meaningless if your escalation doesn't work and your postmortems aren't happening. Fix the process first.
Incident management setup checklist
Set these up in this order:
- Three severity levels (SEV1, SEV2, SEV3). Classify fast, default higher.
- On-call rotation with primary + backup, weekly. Acknowledge the burden.
- One dedicated Slack channel per incident, kept public.
- Automatic escalation across multiple channels. 5-minute timeout before it moves up.
- One-page postmortems within 48 hours. Blameless. Specific owner.
Skip everything else until one of these breaks.
The goal is making the next incident less chaotic than the last one. Run these for a few months and you'll know what needs to change, because you'll have real incidents telling you.
If you want this setup without building it yourself, Runframe handles severity levels, on-call scheduling, multi-channel escalation, and postmortems out of the box. Free to start.
Once your process is running, read how teams scale incident management past 50 engineers for what comes next.