incident-severitysev0sev1

Incident Severity Levels: The Framework That Actually Works

SEV0 vs SEV1 vs SEV2: What engineering teams actually use for incident severity classification. Includes real frameworks, templates, and decision trees.

Runframe TeamJan 17, 202611 min read
Insights from 25+ engineering teams on why alert fatigue is causing real outages

A team told us someone paged the entire org at 3 AM because a dashboard was loading 200ms slower than usual. Meanwhile, actual customer-impacting outages got ignored because "everything is a SEV1."

When you're scaling from 20 to 200 people, it's tough to get severity levels right the first time. Without clear definitions, every incident feels like a crisis and on-call burns out. Here's what we've seen work across dozens of teams at your stage.

Without clear severity levels, you can't prioritize response. See our guide on incident management.

TL;DR

  • We recommend SEV0-SEV4 (clearer than SEV1-SEV5, but start with what works for you)
  • SEV0 = catastrophic, SEV1 = core service down, SEV2 = degraded with workaround, SEV3 = minor, SEV4 = proactive
  • Classify in 30 seconds using: "Is revenue/users impacted? Is there a workaround?"
  • Consider adding SEV4 for proactive work (teams report it prevents 80% of incidents)
  • Severity ≠ Priority (severity = impact, priority = fix order)

SEV0-SEV4: The Framework

We recommend starting at zero, not one. SEV0 = zero room for error—it's more intuitive than SEV1 being your worst case.

That said, if your team is under 50 people, you might start with just 3 levels (SEV1-SEV3) and add SEV0 and SEV4 as you scale. Here's the full framework:

Severity Impact Response Who
SEV0 Catastrophic. Data loss, security breach, total outage, or critical revenue-impacting failure Ack target: 15 min War room (IC + core responders; exec notification depends on your org)
SEV1 Critical. Core service down for everyone Ack target: 30 min On-call + backup
SEV2 Major. Significant degradation, workaround exists Ack target: 1 hour On-call
SEV3 Minor. Limited impact, business hours fix Business hours Don't page
SEV4 Pre-emptive. Could break, proactive fix Backlog Owner + due window

The difference between SEV1 and SEV2? One question: Is there a workaround?

Checkout completely broken = SEV1 (no workaround). Search down but category browsing works = SEV2 (workaround exists).

Simple.

What teams at your stage say:

"Start with 3 levels. Don't over-engineer day one. You can always add SEV0 and SEV4 later."
— CTO, 40-person startup

"We added SEV4 when we hit 80 people. Prevented 38 out of 47 potential incidents in 6 months."
— Engineering Manager, Series B SaaS

Why SEV4 Matters (And When to Add It)

Many teams start without SEV4—it can feel like overhead when you're just trying to survive incidents.

"If nothing's broken, why track it?"

Fair question. Here's when it becomes valuable:

If you're under 50 people: You probably don't need SEV4 yet. Focus on responding to actual incidents first.

When you hit 75-100 people: This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically.

What happens without SEV4 at scale:

→ Disk space hits 100% at 2 AM (could have been SEV4 at 80%)
→ SSL cert expires, users see security warnings (could have been SEV4 at 30 days)
→ Database query gets 10x slower overnight (could have been SEV4 when it hit 2x)

Without SEV4, you're always reacting. Never preventing.

What Each Level Means

SEV0: The Building Is On Fire

Complete outage. Data loss. Security breach. Critical revenue-impacting failure.

Database corrupted? Multi-region outage? Authentication completely broken? Payment processing down?

That's SEV0. Wake everyone. War room. You have 15 minutes.

Real examples:

  • Database corruption with data loss (can't recover from backup)
  • AWS us-east-1 down AND your backup region failed
  • Security breach exposing customer data
  • Authentication completely broken (nobody can log in)
  • Payment processing down (revenue loss >$10K/hour)

SEV1: Core Service Down

Major impact but not catastrophic. Core service unavailable for most/all customers, with no workaround.

API totally down. Checkout completely broken. Search gone (if search is a core workflow for your product). Auth intermittent for a meaningful subset of users.

Page on-call immediately. All hands on deck during business hours. 30-minute target.

Real examples:

  • Total API outage (all endpoints returning 500)
  • Checkout flow completely broken (can't process payments)
  • Search functionality down (core feature for your product)
  • Authentication intermittent (meaningful subset of users can't log in)
  • Performance degradation (APIs materially degraded, not just slower)

SEV2: Significant but Workaround Exists

Broken but usable. Meaningful subset of customers affected, or core functionality degraded but usable.

Checkout failing for some users? File uploads broken? API materially degraded but responding?

Primary on-call handles it. Don't wake backup. 1-hour target.

Real examples:

  • Checkout failing for some users (payment gateway issue for some cards)
  • File uploads completely broken (users can't upload, but can use existing files)
  • API materially degraded but usable (users can still complete key workflows, possibly slower)
  • Dashboard not loading (users can still use core product)
  • Single region degradation (multi-region setup, one region struggling)

SEV3: Minor

Partial failure. Limited impact. Not urgent.

Profile pictures broken. Intermittent errors that auto-recover. Reporting delayed.

Fix during business hours. Don't page on-call. Can wait until morning.

Real examples:

  • Minor feature broken (user profile pictures not displaying)
  • Intermittent errors that auto-recover (happens a few times/hour, clears itself)
  • Reporting delay (analytics data not real-time, updates hourly)
  • Non-critical integration failing (Slack notifications delayed, email works)
  • UI polish issues (button misaligned, font wrong)

SEV4: Pre-emptive

Nothing broken yet. But something could.

Disk at 80%. SSL expiring soon. Query slowing down. Dependency vulnerability. Monitoring gap.

Create a ticket with an owner + due window (e.g., "this sprint" / "within 30 days"). No page needed.

Real examples:

  • Disk space at 80% (not critical yet, but will be in 2 weeks)
  • SSL certificate expiring in 30 days
  • Database query degrading (taking 2x longer, not failed yet)
  • Dependency vulnerability (CVE in a library, not exploited)
  • Monitoring gap discovered (no alerting for a critical service)

Classify Fast. Don't Debate.

Target: 30 seconds to classify.

When you're in the middle of an incident, speed matters more than perfection. If you're debating SEV1 vs SEV2 for 5 minutes while customers wait, just pick one and move on.

Pro tip: Default higher when uncertain. It's easier to downgrade a SEV1 to SEV2 later than explain why you under-classified and delayed response.

Is this catastrophic (data loss, security breach, total outage)? → SEV0

Is a core workflow blocked for most users?

  • No workaround → SEV1
  • Workaround exists → SEV2

Otherwise: limited impact → SEV3; not broken yet → SEV4

Tie-breaker: pick higher, note why, downgrade later.

Common Questions (What We've Learned from Teams at Your Stage)

"It's 2 AM and I'm not sure if this is SEV1 or SEV2"

Default SEV1. Assess the situation. Page backup only if blocked or primary hasn't responded within your escalation window.

You can downgrade in the morning. You can't un-break customer trust.

"Only 5% of users are affected, but they're our biggest customers"

Use your "materially impacted" definition. If those 5% represent 40% of revenue, it's material.

SEV1.

"The bug is cosmetic but our CEO is freaking out"

Still SEV3. Severity = customer impact, not internal panic.

But maybe add "Executive visibility" as a separate flag. Some teams use:

  • Severity: SEV3 (minor)
  • Priority: P1 (fix today)
  • Visibility: High (CEO watching)

This way you fix it fast without training on-call to page for non-issues.

"We fixed it in 5 minutes, do we still call it SEV1?"

Yes. Severity is based on potential impact, not duration.

If the database was completely down (even for 5 minutes), that's SEV1.

Duration doesn't change severity. It goes in MTTR metrics.

What Makes Severity Levels Actually Work

The key is specificity.

Vague (doesn't help at 3 AM): "SEV1 is when something important is broken."

Specific (makes decisions instant): "SEV1 is when a core service is down for all customers, with no workaround."

Frameworks That Actually Work (Choose Based on Your Size)

Startup Starter (20-50 people)

Start simple with 3 levels. Add more as you scale.

Severity Impact Response
SEV1 Core service down Page everyone
SEV2 Degraded but usable Page on-call
SEV3 Minor, can wait Business hours

Scaling Company (50-150 people)

Add SEV0 when catastrophic incidents become possible.

Severity Impact Page Who? Ack SLA
SEV0 Catastrophic War room 15 min
SEV1 Core service down On-call + backup 30 min
SEV2 Significant degradation On-call 1 hour
SEV3 Minor issues Business hours 1 day
SEV4 Proactive work Backlog None

Enterprise-Bound (150+ people)

Full framework with war rooms and executive escalation.

Severity Impact Page Who? Ack SLA
SEV0 Catastrophic War room 15 min
SEV1 Core service down On-call + backup 30 min
SEV2 Significant degradation On-call 1 hour
SEV3 Minor issues Business hours 1 day
SEV4 Proactive work Backlog None

How to Evolve Your Severity Levels as You Scale

Starting with SEV1 vs SEV0

If you're under 50 people: Starting with SEV1-SEV3 is totally fine. Many teams do this.

As you grow past 100 people: Consider adding SEV0 for truly catastrophic incidents (data loss, security breaches). "Zero" = zero room for error, which makes the hierarchy more intuitive.

Why it matters: As your maximum possible blast radius grows, you need a tier above "critical outage" for existential threats.

When to Add SEV4 (Proactive Work)

If you're under 50 people: You probably don't need SEV4 yet. Focus on responding to actual incidents first.

When you hit 75-100 people: This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically.

What changes: Instead of jumping from "everything's fine" to "everything's on fire," you can track warning signs (disk at 80%, SSL expiring soon, query degrading) and fix them before they page someone at 3 AM.

One team added SEV4 at 80 people and prevented 80% of potential incidents over 6 months.

Ignoring Business Impact

The problem: Technical severity ≠ business severity. A "minor" pricing page typo can be catastrophic if it causes chargebacks.

The fix: Define severity in terms of customer impact and revenue, not technical complexity.

Severity vs Priority

Teams confuse these constantly.

Severity = Business impact (doesn't change)
Priority = Fix order (changes based on context)

Example:

Footer has a typo: "Contact sales@compnay.com"

  • Severity: SEV3 (minor impact, users can still email sales@company.com directly)
  • Priority: P3 (fix this week)

BUT: Legal says the wrong email violates our contract SLA.

  • Severity: Still SEV3 (customer experience unchanged)
  • Priority: Now P1 (fix today, legal risk)

Severity didn't change. Priority did.

Another example:

Database completely down.

  • Severity: SEV0 (catastrophic)
  • Priority: P1 (obviously)

But your lead DBA is on vacation and backup doesn't know the system.

  • Severity: Still SEV0 (impact unchanged)
  • Priority: Still P1, but now you escalate to vendor support

Severity = "how bad is it?"
Priority = "when/how do we fix it?"

Don't conflate them.

"Severity is 'how bad is it?' Priority is 'when do we fix it?' Don't conflate them."
— Engineering Manager, Series B Healthcare SaaS

Make It Work: Rollout Plan

Week 1: Start Simple

If you're 20-50 people: Copy the 3-level version (SEV1-SEV3) and customize examples to your product.

If you're 50-150 people: Use the 4-level version (SEV0-SEV3 or SEV1-SEV4).

If you're 150+ people: Go with the full 5-level framework (SEV0-SEV4).

The key is customizing examples to YOUR business. B2B looks different than B2C. Enterprise SaaS looks different than consumer apps.

Week 1: Get Buy-In

Share in Slack. Review in standup.

Most importantly: Get agreement from the people who'll be woken up at 3 AM.

If on-call hates it, they won't use it.

"The best severity framework is the one your team actually uses. If on-call hates it, they'll ignore it."
— SRE Manager, 180-person infrastructure company

Weeks 2-5: Use It

Classify every incident. Track how it goes.

Week 6: Iterate

After 30 days, ask:

  • Classification debates? → Clarify definitions
  • SEV3s waking people? → Make "don't page" explicit
  • SEV4s actually getting fixed? → It's working

Expect to adjust 2-3 times in the first 6 months. That's normal.

Quick Reference: During an Incident

Q: "Is this SEV1 or SEV2?"
A: Can customers work around it? Yes = SEV2. No = SEV1.

Q: "Only 10% of users affected. Still SEV1?"
A: Is that 10% material to your business? (Check your definition)

Q: "We fixed it fast. Was it really SEV1?"
A: Severity = potential impact, not duration. Yes, still SEV1.

Q: "CEO is panicking but customer impact is minor"
A: Severity = customer impact. This is SEV3. (But maybe Priority P1)

Q: "Not sure. What do I do?"
A: Default higher. Downgrade later if needed.

FAQ

Q: SEV0-SEV4 or SEV1-SEV5?
A: SEV0-SEV4. "Zero" means no room for error. Mature teams use this.
Q: Can't tell if SEV1 or SEV2?
A: Default higher (SEV1). Easier to downgrade than explain under-classification.
Q: How many levels?
A: Start with 3-4. Most end up at 5 (SEV0-SEV4).
Q: Does severity change during an incident?
A: No. Based on initial impact. If things change dramatically, document it in the postmortem.
Q: Who decides?
A: Incident commander or first responder. Disagreement? Default higher, resolve in postmortem.

Generate Your Framework in 2 Minutes

If you want a copy/paste template, there's a severity matrix generator here:

/tools/incident-severity-matrix-generator

Or copy the table from this article and adapt it. Either way, have something defined before your next incident.

Share this article

Found this helpful? Share it with your team.

Related Articles

Jan 15, 2026

Incident Management vs Incident Response: Why Fast MTTR Isn't Enough

Incident management vs incident response: why the distinction matters for teams scaling incidents. Practical guide with specific examples and failure patterns.

Read more
Jan 10, 2026

State of Incident Management 2025: The AI Paradox (Why Toil Rose to 30% from 25%)

State of Incident Management 2025: Operational toil rose to 30% despite AI investment. New data on burnout, alert fatigue, and the OpsGenie migration.

Read more
Jan 7, 2026

Incident Response Playbook: Scripts, Roles & Templates (Slack)

Slack incident response playbook: roles, severity levels, update cadence, escalation rules, and copy/paste templates for production outages. Based on real teams.

Read more
Jan 2, 2026

On-Call Rotation: Primary + Backup Schedule, Escalation Rules, and Handoffs (Without Burnout)

Set up on-call rotation that works: primary+backup schedule, 5-min escalation rules, 2-min handoffs. Includes templates and compensation benchmarks.

Read more
Dec 29, 2025

Post-Incident Review Templates: What Works (3 Ready-to-Use)

Skip 5-page postmortems nobody reads. What actually works based on research with 25+ engineering teams, plus three copy-paste templates. Includes blameless postmortem examples.

Read more
Dec 22, 2025

Incident Coordination Guide: Cut Context Switching and Improve Response Time

Cut coordination overhead during incidents. Reduce context switching, speed response, and improve postmortem quality. Practical guide based on research with 25+ engineering teams.

Read more
Dec 15, 2025

Incident Management at Scale: Research from 25+ Engineering Teams (40-180 People)

Research from 25+ engineering teams on scaling incident management. The 4 stages every team goes through, why most get stuck at Stage 3, and what actually works. For teams 12-180 people.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates—all without leaving Slack.

Join the Waitlist