Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

A team told us someone paged the entire org at 3 AM because a dashboard was loading 200ms slower than usual. Meanwhile, actual customer-impacting outages got ignored because "everything is a SEV1."

When you're scaling from 20 to 200 people, it's tough to get severity levels right the first time. Without clear definitions, every incident feels like a crisis and on-call burns out. Here's what we've seen work across dozens of teams at your stage.

Without clear severity levels, you can't prioritize response. Teams often confuse incident response (fixing fast) with incident management (preventing recurrence). Read our incident management vs incident response guide to see why MTTR alone isn't enough.

TL;DR

We recommend SEV0-SEV4 (clearer than SEV1-SEV5, but start with what works for you)
SEV0 = catastrophic, SEV1 = core service down, SEV2 = degraded with workaround, SEV3 = minor, SEV4 = proactive
Classify in 30 seconds using: "Is revenue/users impacted? Is there a workaround?"
Consider adding SEV4 for proactive work (teams report it prevents 80% of incidents)
Severity ≠ Priority (severity = impact, priority = fix order)

SEV0-SEV4: The Framework

We recommend starting at zero, not one. SEV0 = zero room for error—it's more intuitive than SEV1 being your worst case.

That said, if your team is under 50 people, you might start with just 3 levels (SEV1-SEV3) and add SEV0 and SEV4 as you scale. Here's the full framework:

Complete SEV0-SEV4 framework showing impact description, response target time, and who responds for each severity level
Severity	Impact	Response	Who
SEV0	Catastrophic. Data loss, security breach, total outage, or critical revenue-impacting failure	Ack target: 15 min	War room (IC + core responders; exec notification depends on your org)
SEV1	Critical. Core service down for everyone	Ack target: 30 min	On-call + backup
SEV2	Major. Significant degradation, workaround exists	Ack target: 1 hour	On-call
SEV3	Minor. Limited impact, business hours fix	Business hours	Don't page
SEV4	Pre-emptive. Could break, proactive fix	Backlog	Owner + due window

The difference between SEV1 and SEV2? One question: Is there a workaround?

Checkout completely broken = SEV1 (no workaround). Search down but category browsing works = SEV2 (workaround exists).

Simple.

What teams at your stage say:

"Start with 3 levels. Don't over-engineer day one. You can always add SEV0 and SEV4 later."
— CTO, 40-person startup

"We added SEV4 when we hit 80 people. Prevented 38 out of 47 potential incidents in 6 months."
— Engineering Manager, Series B SaaS

Why SEV4 Matters (And When to Add It)

Many teams start without SEV4—it can feel like overhead when you're just trying to survive incidents.

"If nothing's broken, why track it?"

Fair question. Here's when it becomes valuable:

If you're under 50 people: You probably don't need SEV4 yet. Focus on responding to actual incidents first.

When you hit 75-100 people: This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically.

What happens without SEV4 at scale:

→ Disk space hits 100% at 2 AM (could have been SEV4 at 80%)
→ SSL cert expires, users see security warnings (could have been SEV4 at 30 days)
→ Database query gets 10x slower overnight (could have been SEV4 when it hit 2x)

Without SEV4, you're always reacting. Never preventing.

What Each Level Means

SEV0: The Building Is On Fire

Complete outage. Data loss. Security breach. Critical revenue-impacting failure.

Database corrupted? Multi-region outage? Authentication completely broken? Payment processing down?

That's SEV0. Wake everyone. War room. You have 15 minutes.

Real examples:

Database corruption with data loss (can't recover from backup)
AWS us-east-1 down AND your backup region failed
Security breach exposing customer data
Authentication completely broken (nobody can log in)
Payment processing down (revenue loss >$10K/hour)

SEV1: Core Service Down

Major impact but not catastrophic. Core service unavailable for most/all customers, with no workaround.

API totally down. Checkout completely broken. Search gone (if search is a core workflow for your product). Auth intermittent for a meaningful subset of users.

Page on-call immediately. All hands on deck during business hours. 30-minute target.

Real examples:

Total API outage (all endpoints returning 500)
Checkout flow completely broken (can't process payments)
Search functionality down (core feature for your product)
Authentication intermittent (meaningful subset of users can't log in)
Performance degradation (APIs materially degraded, not just slower)

SEV2: Significant but Workaround Exists

Broken but usable. Meaningful subset of customers affected, or core functionality degraded but usable.

Checkout failing for some users? File uploads broken? API materially degraded but responding?

Primary on-call handles it. Don't wake backup. 1-hour target.

Real examples:

Checkout failing for some users (payment gateway issue for some cards)
File uploads completely broken (users can't upload, but can use existing files)
API materially degraded but usable (users can still complete key workflows, possibly slower)
Dashboard not loading (users can still use core product)
Single region degradation (multi-region setup, one region struggling)

SEV3: Minor

Partial failure. Limited impact. Not urgent.

Profile pictures broken. Intermittent errors that auto-recover. Reporting delayed.

Fix during business hours. Don't page on-call. Can wait until morning.

Real examples:

Minor feature broken (user profile pictures not displaying)
Intermittent errors that auto-recover (happens a few times/hour, clears itself)
Reporting delay (analytics data not real-time, updates hourly)
Non-critical integration failing (Slack incident notifications delayed, email works)
UI polish issues (button misaligned, font wrong)

SEV4: Pre-emptive

Nothing broken yet. But something could.

Disk at 80%. SSL expiring soon. Query slowing down. Dependency vulnerability. Monitoring gap.

Create a ticket with an owner + due window (e.g., "this sprint" / "within 30 days"). No page needed.

Real examples:

Disk space at 80% (not critical yet, but will be in 2 weeks)
SSL certificate expiring in 30 days
Database query degrading (taking 2x longer, not failed yet)
Dependency vulnerability (CVE in a library, not exploited)
Monitoring gap discovered (no alerting for a critical service)

Classify Fast. Don't Debate.

Target: 30 seconds to classify.

When you're in the middle of an incident, speed matters more than perfection. If you're debating SEV1 vs SEV2 for 5 minutes while customers wait, just pick one and move on.

Pro tip: Default higher when uncertain. It's easier to downgrade a SEV1 to SEV2 later than explain why you under-classified and delayed response.

Is this catastrophic (data loss, security breach, total outage)? → SEV0

Is a core workflow blocked for most users?

No workaround → SEV1
Workaround exists → SEV2

Otherwise: limited impact → SEV3; not broken yet → SEV4

Tie-breaker: pick higher, note why, downgrade later.

Common Questions (What We've Learned from Teams at Your Stage)

"It's 2 AM and I'm not sure if this is SEV1 or SEV2"

Default SEV1. Assess the situation. Page backup only if blocked or primary hasn't responded within your escalation window.

You can downgrade in the morning. You can't un-break customer trust.

"Only 5% of users are affected, but they're our biggest customers"

Use your "materially impacted" definition. If those 5% represent 40% of revenue, it's material.

SEV1.

"The bug is cosmetic but our CEO is freaking out"

Still SEV3. Severity = customer impact, not internal panic.

But maybe add "Executive visibility" as a separate flag. Some teams use:

Severity: SEV3 (minor)
Priority: P1 (fix today)
Visibility: High (CEO watching)

This way you fix it fast without training on-call to page for non-issues.

"We fixed it in 5 minutes, do we still call it SEV1?"

Yes. Severity is based on potential impact, not duration.

If the database was completely down (even for 5 minutes), that's SEV1.

Duration doesn't change severity. It goes in MTTR metrics.

What Makes Severity Levels Actually Work

The key is specificity.

Vague (doesn't help at 3 AM): "SEV1 is when something important is broken."

Specific (makes decisions instant): "SEV1 is when a core service is down for all customers, with no workaround."

Frameworks That Actually Work (Choose Based on Your Size)

Startup Starter (20-50 people)

Start simple with 3 levels. Add more as you scale.

Starter severity framework for startups with 20-50 people
Severity	Impact	Response
SEV1	Core service down	Page everyone
SEV2	Degraded but usable	Page on-call
SEV3	Minor, can wait	Business hours

Scaling Company (50-150 people)

Add SEV0 when catastrophic incidents become possible.

Severity framework for scaling companies of 50-150 people with acknowledgment SLAs
Severity	Impact	Page Who?	Ack SLA
SEV0	Catastrophic	War room	15 min
SEV1	Core service down	On-call + backup	30 min
SEV2	Significant degradation	On-call	1 hour
SEV3	Minor issues	Business hours	1 day
SEV4	Proactive work	Backlog	None

Enterprise-Bound (150+ people)

Full framework with war rooms and executive escalation.

Enterprise severity framework for 150+ person organizations with SLAs and escalation
Severity	Impact	Page Who?	Ack SLA
SEV0	Catastrophic	War room	15 min
SEV1	Core service down	On-call + backup	30 min
SEV2	Significant degradation	On-call	1 hour
SEV3	Minor issues	Business hours	1 day
SEV4	Proactive work	Backlog	None

How to Evolve Your Severity Levels as You Scale

Starting with SEV1 vs SEV0

If you're under 50 people: Starting with SEV1-SEV3 is totally fine. Many teams do this.

As you grow past 100 people: Consider adding SEV0 for truly catastrophic incidents (data loss, security breaches). "Zero" = zero room for error, which makes the hierarchy more intuitive.

Why it matters: As your maximum possible blast radius grows, you need a tier above "critical outage" for existential threats.

When to Add SEV4 (Proactive Work)

If you're under 50 people: You probably don't need SEV4 yet. Focus on responding to actual incidents first.

When you hit 75-100 people: This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically.

What changes: Instead of jumping from "everything's fine" to "everything's on fire," you can track warning signs (disk at 80%, SSL expiring soon, query degrading) and fix them before they page someone at 3 AM.

One team added SEV4 at 80 people and prevented 80% of potential incidents over 6 months.

Ignoring Business Impact

The problem: Technical severity ≠ business severity. A "minor" pricing page typo can be catastrophic if it causes chargebacks.

The fix: Define severity in terms of customer impact and revenue, not technical complexity.

Severity vs Priority

Teams confuse these constantly.

Severity = Business impact (doesn't change)
Priority = Fix order (changes based on context)

Example:

Footer has a typo: "Contact sales@compnay.com"

Severity: SEV3 (minor impact, users can still email sales@company.com directly)
Priority: P3 (fix this week)

BUT: Legal says the wrong email violates our contract SLA.

Severity: Still SEV3 (customer experience unchanged)
Priority: Now P1 (fix today, legal risk)

Severity didn't change. Priority did.

Another example:

Database completely down.

Severity: SEV0 (catastrophic)
Priority: P1 (obviously)

But your lead DBA is on vacation and backup doesn't know the system.

Severity: Still SEV0 (impact unchanged)
Priority: Still P1, but now you escalate to vendor support

Severity = "how bad is it?"
Priority = "when/how do we fix it?"

Don't conflate them.

"Severity is 'how bad is it?' Priority is 'when do we fix it?' Don't conflate them."
— Engineering Manager, Series B Healthcare SaaS

Make It Work: Rollout Plan

Week 1: Start Simple

If you're 20-50 people: Copy the 3-level version (SEV1-SEV3) and customize examples to your product.

If you're 50-150 people: Use the 4-level version (SEV0-SEV3 or SEV1-SEV4).

If you're 150+ people: Go with the full 5-level framework (SEV0-SEV4).

The key is customizing examples to YOUR business. B2B looks different than B2C. Enterprise SaaS looks different than consumer apps.

Week 1: Get Buy-In

Share in Slack. Review in standup.

Most importantly: Get agreement from the people who'll be woken up at 3 AM.

If on-call hates it, they won't use it.

"The best severity framework is the one your team actually uses. If on-call hates it, they'll ignore it."
— SRE Manager, 180-person infrastructure company

Weeks 2-5: Use It

Classify every incident. Track how it goes.

Week 6: Iterate

After 30 days, ask:

Classification debates? → Clarify definitions
SEV3s waking people? → Make "don't page" explicit
SEV4s actually getting fixed? → It's working

Expect to adjust 2-3 times in the first 6 months. That's normal.

Quick Reference: During an Incident

Q: "Is this SEV1 or SEV2?"
A: Can customers work around it? Yes = SEV2. No = SEV1.

Q: "Only 10% of users affected. Still SEV1?"
A: Is that 10% material to your business? (Check your definition)

Q: "We fixed it fast. Was it really SEV1?"
A: Severity = potential impact, not duration. Yes, still SEV1.

Q: "CEO is panicking but customer impact is minor"
A: Severity = customer impact. This is SEV3. (But maybe Priority P1)

Q: "Not sure. What do I do?"
A: Default higher. Downgrade later if needed.

FAQ

Q: SEV0-SEV4 or SEV1-SEV5?

A: SEV0-SEV4. "Zero" means no room for error. Mature teams use this.

Q: Can't tell if SEV1 or SEV2?

A: Default higher (SEV1). Easier to downgrade than explain under-classification.

Q: How many levels?

A: Start with 3-4. Most end up at 5 (SEV0-SEV4).

Q: Does severity change during an incident?

A: No. Based on initial impact. If things change dramatically, document it in the postmortem.

Q: Who decides?

A: Incident commander or first responder. Disagreement? Default higher, resolve in postmortem.

Generate Your Framework in 2 Minutes

If you want a copy/paste template, there's a severity matrix generator here:

severity matrix generator

Or copy the table from this article and adapt it. Either way, have something defined before your next incident.

TL;DR

SEV0-SEV4: The Framework

Why SEV4 Matters (And When to Add It)

What Each Level Means

SEV0: The Building Is On Fire

SEV1: Core Service Down

SEV2: Significant but Workaround Exists

SEV3: Minor

SEV4: Pre-emptive

Classify Fast. Don't Debate.

Common Questions (What We've Learned from Teams at Your Stage)

"It's 2 AM and I'm not sure if this is SEV1 or SEV2"

"Only 5% of users are affected, but they're our biggest customers"

"The bug is cosmetic but our CEO is freaking out"

"We fixed it in 5 minutes, do we still call it SEV1?"

What Makes Severity Levels Actually Work

Frameworks That Actually Work (Choose Based on Your Size)

Startup Starter (20-50 people)

Scaling Company (50-150 people)

Enterprise-Bound (150+ people)

How to Evolve Your Severity Levels as You Scale

Starting with SEV1 vs SEV0

When to Add SEV4 (Proactive Work)

Ignoring Business Impact

Severity vs Priority

Make It Work: Rollout Plan

Week 1: Start Simple

Week 1: Get Buy-In

Weeks 2-5: Use It

Week 6: Iterate

Quick Reference: During an Incident

FAQ

Generate Your Framework in 2 Minutes

Next Reads

Share this article

Related Articles

Build, Open Source, or Buy Incident Management in 2026

Slack Incident Management: What Works, What Breaks, and When You Need a Tool

PagerDuty Alternatives in 2026: What Engineering Teams Actually Switch To (And Why)

Incident Communication Templates (Copy-Paste)

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

Runbook vs Playbook: The Difference That Confuses Everyone

OpsGenie Shutdown 2027: The Complete Migration Guide

How to Reduce MTTR in 2026: The Coordination Framework

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

State of Incident Management 2026: Toil Rose 30% Despite AI

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

On-Call Rotation: Schedules, Handoffs & Templates

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Incident Coordination Guide: Cut Context Switching and Improve Response Time

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Automate Your Incident Response