Incident management for early-stage engineering teams

At 15 engineers, you can get away without formal incident management. Someone posts in #engineering, the right person sees it, they fix it. It works until it doesn't.

Then you hit 20. Maybe 30. Someone pages the entire team at 2 AM because a staging dashboard loaded slow. The last real incident took 45 minutes before anyone figured out who should even be looking at it.

That's the inflection point. Not when things break (things always break) but when the coordination around the break starts costing more than the break itself. Some teams hit it at 15 people. Most feel it by 30.

This is the setup guide for that moment. What to set up, in what order, with opinionated defaults that work whether you're 15 engineers or 100.

TL;DR: Start with three severity levels (SEV1-3), set up weekly on-call with primary + backup, create a dedicated Slack channel per incident, wire automatic multi-channel escalation with a 5-minute timeout, and do one-page blameless postmortems within 48 hours. Skip everything else until one of these breaks.

What you'll set up

Three severity levels, enough to triage, not enough to argue about
On-call rotation, primary + backup, weekly, with real escalation
Incident channels, dedicated Slack channel per incident
Escalation that works, multi-channel, automatic, no gaps
Short postmortems, one page, 48 hours, blameless
What to skip, the stuff that doesn't matter yet

Start with three severity levels, not five

You need enough levels to make decisions, not so many that you start arguments about classification.

SEV1: Customers can't use the product. Revenue is affected. Drop everything.

SEV2: Something is degraded and customers notice, but there's a workaround. Painful, but not down.

SEV3: Minor or internal. Fix it during business hours.

Three levels. You can add SEV0 (apocalypse scenario) later when you have 50+ engineers and genuinely need a level above "drop everything." You can add SEV4 (proactive work) when you have enough incident volume to categorize prevention separately.

The mistake teams make is copying Google's severity framework on day one. They end up with five levels nobody can distinguish and spend the first 10 minutes of every incident arguing about whether it's a SEV2 or a SEV3.

When in doubt, classify higher. A SEV1 that turns out to be a SEV2 wastes some attention. A SEV2 that was actually a SEV1 wastes customer trust.

Use the severity level to decide two things: who gets paged, and how fast you need to respond. Everything else is overhead at this stage.

Put someone on-call before you need to

The worst time to figure out who's responsible is during an incident.

Most teams wait until after a bad incident to set up on-call. Then they scramble to build a rotation while half the team is still stressed about the last outage. Do it before you need it.

Start simple

Weekly rotation. Primary + backup. That's the minimum.

Primary is the person who gets paged first. Backup is the person who gets paged if primary doesn't respond. Without a backup, a single person in the shower or on a flight means nobody responds for 30 minutes.

Weekly works for most teams. Daily rotations are exhausting, nobody gets into a rhythm. Monthly rotations are too long, the on-call person burns out by week three and starts ignoring alerts.

Cover business hours first

If your customers are mostly in one timezone, start with business-hours on-call. You don't need 24/7 coverage on day one. Add it when your customer base or your SLAs demand it.

Acknowledge the burden

On-call is work. Engineers who carry pagers outside working hours deserve recognition. Some teams pay $200-500/week. Others give comp time. The specific mechanism matters less than the acknowledgment that being on-call is a real cost.

Treat on-call as free and the good engineers leave. It doesn't take long.

One channel per incident

Slack is where your team already works. Use it.

When an incident fires, create a dedicated channel for it. Not a thread in #engineering. Not a DM group. A channel named something obvious, like inc-42-checkout-api-down, where everything about this incident happens. The first responder creates it, from a standard name format, so there's no ambiguity about where to go.

Why this matters

Without a dedicated channel, updates scatter across DMs, threads, and the wrong channels. Someone asks "what's the latest?" and three people answer with three different versions. The CEO finds a 20-minute-old message and panics.

With one, there's one place to look. Status updates, debugging notes, decisions, all in the same channel. If the update isn't in the incident channel, it didn't happen.

How it works in practice

Alert fires, incident channel gets created, responders get pulled in. All updates go there. When it's resolved, archive the channel.

Keep the channel public. Leadership will check it during a SEV1 whether you invite them or not. Better they read a clean timeline than ping engineers for updates mid-debug.

Escalation is not optional

This is where most DIY setups fail. They page once and hope.

The failure mode looks like this: an alert fires at 2 AM. The on-call engineer's phone is on silent. Or they're sick. Or they looked at the notification and fell back asleep. Nobody else knows. Twenty minutes later, customers are complaining on Twitter and your CEO is texting the CTO asking what's happening.

Automatic, not manual

If the on-call person doesn't acknowledge within 5 minutes, escalate. Automatically. Don't rely on someone noticing and manually paging the backup. At 2 AM, nobody is watching.

What you want is an escalation chain where each step gets harder to ignore:

0 min: Slack DM + push notification to primary on-call
2-5 min: SMS and voice call to primary if still unacknowledged
5 min: Page the backup on-call, all channels
If neither responds: Escalate to engineering manager

Notice each step uses a more interruptive channel than the last. If your escalation sends another Slack message to someone who already missed the first one, you haven't escalated. You've just been louder in the same room.

Postmortems that people actually read

One page

Keep the postmortem to one page. Nobody reads the five-page ones, so they're worse than useless. They consume time to write and teach nothing because nobody opens them.

Answer three questions:

What happened? Timeline. What broke, when, what was the impact.
Why did it happen? Root cause. Not "the server crashed" but why the server crashed and why you didn't catch it earlier.
What are we changing? 1-3 specific action items with owners and deadlines.

If you need more detail for a major incident, add an appendix. But the core document that people read should fit on one page.

48-hour rule

If the postmortem isn't written within 48 hours, it won't get written. Details fade, people move on, the next sprint starts and nobody circles back.

Assign an owner immediately after the incident resolves. Not "the team," a specific person with a specific deadline.

Blameless is not optional

The first time someone gets called out in a postmortem, nobody writes honest ones again. Engineers will sanitize everything. The postmortem becomes theater, a document that exists to prove you did a postmortem, not to prevent the next incident.

Focus on systems, not people. "The deploy went out without a canary" not "Alex deployed without checking."

Not every incident needs one. SEV1: always. SEV2: judgment call, did you learn something? SEV3: a brief note in the incident timeline is enough.

What to skip (for now)

The biggest risk at this stage isn't missing a feature. It's overbuilding process that nobody follows.

Runbooks and playbooks can wait. You don't have enough incident patterns yet. After you've seen the same type of incident three times, write a runbook for it. Before that, you're writing fiction.

Don't bother with workflow automation either. Do the process manually for 20 incidents first. You'll learn what actually needs automating versus what you assumed would.

SLOs and error budgets? At 30 engineers, you already know your service is unreliable. You don't need a dashboard to confirm it. Unless you're selling to enterprise or running infra-heavy systems, in which case basic SLO thinking earlier doesn't hurt. But formal error budgets can wait until 100+ engineers when you need to make real tradeoffs between reliability and shipping.

For most B2B teams at this stage, reliable escalation matters more than a status page. If your customers expect proactive comms, use a hosted service. Don't build one.

And skip incident analytics for now. MTTR dashboards are meaningless if your escalation doesn't work and your postmortems aren't happening. Fix the process first.

Incident management setup checklist

Set these up in this order:

Three severity levels (SEV1, SEV2, SEV3). Classify fast, default higher.
On-call rotation with primary + backup, weekly. Acknowledge the burden.
One dedicated Slack channel per incident, kept public.
Automatic escalation across multiple channels. 5-minute timeout before it moves up.
One-page postmortems within 48 hours. Blameless. Specific owner.

Skip everything else until one of these breaks.

The goal is making the next incident less chaotic than the last one. Run these for a few months and you'll know what needs to change, because you'll have real incidents telling you.

If you want this setup without building it yourself, Runframe handles severity levels, on-call scheduling, multi-channel escalation, and postmortems out of the box. Free to start.

Once your process is running, read how teams scale incident management past 50 engineers for what comes next.

Common questions

When does a team need formal incident management?

When coordination during incidents starts costing more time than the incident itself. For most teams, that's somewhere between 20 and 40 engineers. If two people debugged the same thing independently, or if leadership asked for updates that nobody could provide, you're there.

How many people should be on an on-call rotation?

Minimum four for a weekly rotation, so each person is on-call one week per month. Fewer than that and burnout becomes real. If you only have two or three people who can respond, start with business-hours-only and staff up.

Do we need a tool or can we use Slack?

Slack handles coordination well. It doesn't handle paging, escalation, on-call scheduling, or audit trails. For help choosing a tool, see our guide to the best incident management tools with on-call. Most teams outgrow pure-Slack incident management around 20-25 engineers, sometimes earlier if you have enterprise customers or SLA commitments. At that point, you need something that pages people reliably through multiple channels, tracks who's on-call, and escalates automatically when nobody responds. That's the gap Runframe is built for.

How often should we do postmortems?

Every SEV1 gets a postmortem. SEV2 gets one if you learned something or if it affected customers. SEV3 doesn't need a formal postmortem, a note in the incident timeline is fine. Don't postmortem everything or the team will burn out on process.

Should we build our own incident management tooling?

At 20-50 engineers, almost certainly not. The cost of building and maintaining incident tooling (Slack bots, paging logic, escalation chains, on-call scheduling) adds up faster than a subscription. We broke down the real costs in our build, open source, or buy guide. For a comparison of tools that work at this stage, see our best incident management tools for startups list. If you're considering alternatives to more expensive enterprise tools, see our top PagerDuty alternatives for options that fit smaller teams.

What you'll set up

Start with three severity levels, not five

Put someone on-call before you need to

Start simple

Cover business hours first

Acknowledge the burden

One channel per incident

Why this matters

How it works in practice

Escalation is not optional

Automatic, not manual

Postmortems that people actually read

One page

48-hour rule

Blameless is not optional

What to skip (for now)

Incident management setup checklist

Common questions

Share this article

Related Articles

Your AI Agent Just Handled That Incident. Now What?

OpsGenie End of Support: The Dates, What Atlassian Decided, and What to Do Now

Your AI agent already knows your system better than ours ever will

Your Agent Can Manage Incidents Now

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

Build, Open Source, or Buy Incident Management in 2026

Slack Incident Management: What Works and What Breaks

Best PagerDuty Alternatives 2026: Slack, On-Call, and Pricing Compared

Incident Communication Templates: 8 Free Examples [Copy-Paste]

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

Runbook vs Playbook: The Difference That Confuses Everyone

OpsGenie Shutdown 2027: The Complete Migration Guide

How to Reduce MTTR in 2026: The Coordination Framework

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Incident Management vs Incident Response: What's the Difference?

State of Incident Management 2026: Toil Rose 30% Despite AI

Slack Incident Response Playbook: Roles, Scripts & Templates

On-Call Rotation: Schedules, Handoffs & Templates

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Incident Coordination: Cut Context Switching, Fix Faster

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Automate Your Incident Response