Build, Open Source, or Buy Incident Management in 2026

Your senior engineer says "give me a weekend with Claude and Cursor, I'll have a working incident bot by Monday." Copilot writes the paging logic. The escalation state machine practically builds itself.

They're right about the first version. They're wrong about the next three years.

We did back-of-napkin math on three-year total cost of ownership for a 20-person engineering team:

Build from scratch: $233K-$395K
Open source (self-host): $99K-$360K (mostly maintainer time)
Buy commercial: $11K-$83K (varies by vendor pricing model)

Sizing model (3-year):

Build_TCO = MVP + (FTE x LoadedCost x 3) + (Infra x 3) + Rebuilds
OSS_TCO = (FTE x LoadedCost x 3) + (Infra x 3) + Migrations
Buy_TCO = (Subscription x 3) + Onboarding

The bulk of this TCO is engineering time: opportunity cost, not vendor invoices. Building runs 3 to 8 times the cost of buying. Open source sits in the middle. Free to download, not free to run.

This article covers where the money actually goes, what AI tools change (and what they don't), and when building genuinely makes sense.

"You're not spinning up a bot. You're signing up to maintain a system forever."

Disclosure: Runframe builds incident management software. We've included open source options and noted when building is the right call. Found an error? Email hello@runframe.io.

60-Second Version

Under 20 people, no enterprise customers? Structured Slack workflows or incident-bot will get you started. Switch when you hit the limits.

Between 20 and 200 and scaling? Default to buying or self-hosting open source. Only build from scratch if you have real regulatory constraints or incident management is literally your product. See our top incident management tools for startups for options at this stage.

Over 200? You've likely outgrown basic tooling already. This article is mostly aimed at smaller teams, but the cost ratios still hold. Teams at this scale often evaluate PagerDuty alternatives to balance feature needs against TCO.

When I say "incident management" here, I mean the full loop: detection, paging, coordination, comms, and post-incident review. Not just "something that wakes people up."

If you just want the checklist, jump to When to Buy.

The AI Build: What Changed and What Didn't

Two years ago, a competent engineer needed 2-4 weeks to build a basic incident management system. Today, with AI coding tools, that's down to days. A weekend if you're scrappy.

AI is genuinely good at scaffolding. Slack bot setup that used to take days now takes hours. Status page templates, database schemas, escalation logic, API layers. The boilerplate disappears fast. No argument there.

But here's what AI doesn't change:

Slack retires APIs. It just does. The legacy file upload method was sunset in Nov 2025, forcing migrations to a newer upload flow. Legacy custom bots were discontinued in Mar 2025, breaking older bot-based workflows. AI can help you migrate faster, but it can't stop the deprecations from happening.

Phone and SMS paging is an ops problem, not a code problem. Carriers filter aggressively, especially internationally. Routing and deliverability are their own discipline. No prompt is going to fix that.

The engineer who leaves is still the single biggest risk. AI may have written the code, but nobody else knows the architecture decisions, the production edge cases, or why that one Slack workaround exists.

SOC2 auditors don't care that Claude wrote your audit log. They care that it's complete, immutable, and retained for the right duration. Compliance is process work, not code work.

And your incident tool needs to work at 2 AM when your infrastructure is failing. AI can't architect around your own blast radius.

The net effect: AI reduced the initial build from ~$19K-$31K (2-4 weeks) to maybe $8K-$15K (1-2 weeks) in engineer time. That saves ~$10K-$15K of Year-1 cost on a $233K-$395K three-year total. The initial build was never the expensive part.

More Code, More Incidents

Before we get into the numbers: the problem you're solving isn't standing still.

AI-assisted development pushes change velocity up for most teams. Faster velocity usually means more incidents, unless review and testing discipline keeps pace. The blast radius gets bigger when AI-generated changes don't get the same scrutiny as hand-written code. More code shipped faster means more things that can break.

The incident management tool you need in year three will almost certainly be bigger than what you need today.

The Build Illusion: Why It Seems Cheaper Than It Is

With AI coding tools, a good engineer can stand up a basic incident system in days:

Slack bot that creates channels
Basic status page
Escalation logic
Incident history in a database

Looks straightforward. Here's what teams consistently forget.

The Hidden Cost: Dedicated Engineer

Someone needs to own this. Not as a side project. As actual job responsibility.

Example (B2B SaaS running microservices on Kubernetes, ~120 engineers): A team assigned a senior engineer to their custom incident tool "for a quarter." Three years later, it's still a quarter of his time. The original Slack bot had grown to include custom escalation logic, a homegrown status page, and integrations with five internal tools nobody else knew how to maintain.

US high-cost-market senior engineer fully-loaded (salary + benefits + overhead) is often ~$250K-$400K/year. Adjust down ~30-50% for UK/EU typical comp.

Even at 25% allocation, that's $62K-$100K annually in opportunity cost. For one feature.

Sensitivity check: If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract ~$25K-$40K/year from the build cost below. But be honest as 0.1 FTE is rarely enough once the tool is in production.

The Maintenance Tax

SREs have a name for this: the forever-project. What started as a weekend hack becomes a quarter-long effort, then a year-long commitment, then something nobody wants to touch but everyone relies on.

The first three months are fine. The engineer builds it, it works, everyone's happy. Then edge cases start appearing around month four. Slack changes its permission model, or rate limits hit during a real incident, or a new hire asks "why does it work this way?" and nobody has a good answer. The original engineer spends increasing time on support.

Somewhere between month seven and month twelve, the engineer who built it leaves or changes roles. Nobody else understands the code. The team is afraid to touch it. By year two, the tool has real technical debt, nobody wants to work on it, but everyone depends on it.

The Policy Surface Nobody Expects

Once you have an incident system, questions show up that you didn't plan for. Who can declare incidents? Who can close them? How long do you keep the records? Where's the data stored? Can you export it for an audit?

Every internal tool eventually becomes a policy surface. Building the first version is cheap. Keeping up with evolving RBAC, retention, and compliance requirements is where the real time goes.

One pattern we've seen across regulated teams: a 60-person fintech spent ~$80K of engineering time building an incident system. It worked for 18 months. Then Slack platform changes and internal security policy changes hit at the same time. The engineer who built it had left. They spent another ~$40K rewriting it. Six months later, compliance asked for audit trails and data residency. Another ~$30K.

The Reliability Paradox

During a P0, when the database is on fire, customers are angry, and your CEO is watching the Slack channel, your incident tool has to work. Without question.

But most teams host their custom incident tooling on the same infrastructure as their product. Product goes down, incident tool goes down with it. If your internal tool uses the company SSO, you're locked out of your response system the moment your identity provider is part of the outage.

Your incident tool needs different infrastructure than your app. It needs higher availability. It needs separate backups. It needs its own monitoring.

AI tools reduce initial build time. They don't fix the reliability paradox, the policy surface, or the engineer who leaves.

If You Build Anyway

A few things that catch teams off guard: Slack's permission model is more nuanced than it looks, and scoping channel access without granting overly broad permissions is tricky. Bulk operations during real incidents hit rate limits. Phone and SMS paging has deliverability issues that vendors spend years solving. And rebuilds break because nobody remembers why policy X was implemented that way. You either rebuild the wrong thing or spend weeks rediscovering context that left with the original engineer.

If you're going to build regardless, at minimum get these right:

Separate hosting from production (different failure domain)
Paging + escalation state machine (including acknowledgements)
Timeline capture + export (for post-incident review and compliance)
Audit log of key actions (declare, assign, close)

The Real Cost Comparison (20-Person Company, 3-Year TCO)

Back-of-napkin estimates for a 20-person engineering team. Your specific numbers will differ, but the ratios are what matter.

Build from scratch: $233K-$395K. Self-host open source: $99K-$360K. Buy commercial: $11K-$83K.

Building typically runs 3 to 8x the cost of buying, depending on vendor tier and team size. Open source falls in between. No license fees, but the maintainer time adds up.

Where the numbers come from: Levels.fyi's 2025 report shows ~$312K median total compensation for "Senior Engineer" in the US (base + stock + bonus). We applied a standard 1.25-1.4x multiplier for employer-side costs (benefits, payroll taxes, overhead) to get the $250K-$400K fully-loaded range. Adjust down 30-50% for UK/EU. Infrastructure costs are based on AWS pricing for a 3-AZ highly available setup with separate monitoring. Rebuild risk is informed by the Slack deprecations mentioned above, plus typical security and compliance changes over a 3-year window. The ratio held across every scenario we sketched: build costs 3 to 8 times more than buying.

Assumptions: 1-2 weeks initial build time (AI-assisted), 0.25 FTE ongoing maintenance, separate infrastructure for reliability, and periodic rework every 18-24 months for API changes, compliance, and new features.

Plug in your own numbers:

Inputs:
  EngCost     = Fully-loaded eng cost/year (default: $300K)
  BuildWeeks  = Initial build time in weeks (default: 1-2)
  FTE         = Maintainer allocation (default: 0.25)
  Vendor      = Vendor $/user/month (default: $15-100; depends on pricing model)
  Users       = On-call responders (default: 10-15; set to 20 if everyone is a responder)
  Infra       = Hosting/monitoring per year (default: $5K; set to $0 if N/A)
  Rebuild     = Migration/rewrite allowance over 3 years (default: $30K; set to $0 if none)
  Onboarding  = One-time setup/training (default: $5K; set to $0 if self-serve)

Formulas:
  Build cost       = (EngCost / 52) × BuildWeeks
  Run cost/year    = EngCost × FTE
  Buy cost/year    = Vendor × Users × 12
  Build 3-yr TCO   = Build cost + (Run cost/year × 3) + (Infra × 3) + Rebuild
  Buy 3-yr TCO     = (Buy cost/year × 3) + Onboarding

Build (3-year TCO):

Cost | Year 1 | Year 2 | Year 3 | Total
Cost	Year 1	Year 2	Year 3	Total
Initial build (AI-assisted, ~1-2 weeks)	$8K-$15K	$0	$0	$8K-$15K
Dedicated maintainer (25% time)	$62K-$100K	$62K-$100K	$62K-$100K	$186K-$300K
Infrastructure & hosting*	$3K-$10K	$3K-$10K	$3K-$10K	$9K-$30K
Rebuilds & migrations**	$0	$30K-$50K	$0	$30K-$50K
Total	$73K-$125K	$95K-$160K	$65K-$110K	$233K-$395K

*Depends on HA requirements, pager/telephony, audit logging, retention, data residency.
**Triggered by Slack platform changes, org restructuring, compliance requirements, new escalation policies, or new integrations.

Buy (3-year TCO example for a 20-person company):

Cost | Year 1 | Year 2 | Year 3 | Total
Cost	Year 1	Year 2	Year 3	Total
Responder-based pricing (10-15 users × $15-30/mo)***	$2K-$5K	$2K-$5K	$2K-$5K	$5K-$16K
Enterprise per-seat pricing (20 users × $40-100/mo)***	$10K-$24K	$10K-$24K	$10K-$24K	$29K-$72K
Onboarding & setup	$3K-$8K	$0	$0	$3K-$8K
Total (responder-based)	$5K-$13K	$2K-$5K	$2K-$5K	$11K-$27K
Total (enterprise per-seat)	$13K-$32K	$10K-$24K	$10K-$24K	$35K-$83K

***Vendor pricing varies widely. Responder-based tools (pricing per on-call user) are typical for startups and mid-size teams. Enterprise per-seat licensing (pricing per employee) is common with PagerDuty, Opsgenie, and similar tools at higher tiers.

Open source / self-host (3-year TCO example for a 20-person company). Totals below show the same table under two maintainer assumptions (0.1 FTE optimistic vs 0.25 FTE typical):

Cost | Year 1 | Year 2 | Year 3 | Total
Cost	Year 1	Year 2	Year 3	Total
Dedicated maintainer (0.1-0.25 FTE)	$25K-$100K	$25K-$100K	$25K-$100K	$75K-$300K
Infrastructure & hosting*	$3K-$10K	$3K-$10K	$3K-$10K	$9K-$30K
Upgrades & migrations**	$0	$15K-$30K	$0	$15K-$30K
Total (0.1 FTE)	$28K-$50K	$43K-$80K	$28K-$50K	$99K-$180K
Total (0.25 FTE typical)	$65K-$110K	$80K-$140K	$65K-$110K	$210K-$360K

0.1 FTE is optimistic (works if you're deploying a mature tool with minimal customization). 0.25 FTE is typical once you're running it in production with Slack integrations and on-call routing.

*Depends on HA requirements, audit logging/retention, and whether you run paging/telephony yourself.
**Common triggers: Slack API changes, auth/security model changes, major version upgrades, or compliance asks (RBAC/audit/retention).

Sensitivity check: Your numbers will differ based on location, team structure, and requirements. If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract ~$25K-$40K/year from build costs. If you avoid rebuilds entirely, subtract another $30K-$50K. The gap narrows but rarely closes. Under typical assumptions, build costs 3-7x more than buy over three years.

The gap is wider than most teams think. And the build model assumes nothing catastrophic happens - no major rewrites, no security incidents, no key engineer departures.

How the Numbers Change

Most arguments about build vs buy come down to two variables: how much time the maintainer actually spends, and how the vendor prices seats.

If you're optimistic and assume 0.1 FTE with no rebuilds, build drops to ~$92K-$165K over 3 years. That narrows the gap with buying considerably. But 0.1 FTE rarely holds once the tool is in production and people start requesting features.

Under typical assumptions (0.25 FTE, one rebuild or migration event, normal Slack and compliance churn), build and self-host run 3-8x the buy-side cost.

The one scenario where buying looks less attractive: if your vendor prices per employee rather than per responder, and you're forced into a higher enterprise tier. In that case, self-hosting can be rational, but only if you can name an owner and accept the upgrade burden.

The Open Source Path

Open source is a legitimate option if you want to avoid both building from zero and paying license fees. But the options shrank considerably in 2025.

Netflix archived Dispatch in September 2025. It was the most production-ready self-hosted option for years. It's read-only forever now. Netflix had hundreds of engineers maintaining it and still walked away.

Grafana closed-sourced OnCall. The OSS version entered maintenance mode in March 2025 and is scheduled to be fully archived on 2026-03-24. Cloud connection, SMS, phone, and push notifications all stop working after that date. Grafana consolidated everything into a closed-source Cloud IRM product.

Two of the biggest names in open source incident management either archived or closed-sourced their tools in the same twelve-month window. That's the context for what follows.

What's Actually Left

Incidental has Slack integration and status pages, with a hosted option at incidental.dev. It's the most capable truly open source option remaining, though it's still early-stage (v0.1.0).

incident-bot (docs) is Slack-based, self-hostable, Python/PostgreSQL. Integrates with PagerDuty, Jira, Confluence, Statuspage, GitLab, and Zoom. Smaller project, limited on compliance and RBAC out of the box.

Both are MIT licensed. Both are small projects compared to what Dispatch and Grafana OnCall were.

Also worth knowing: IncidentFox is an AI-powered SRE platform. The core is Apache 2.0, but the production security layer (sandbox isolation, credential injection) is BSL 1.1, meaning production use of those components requires a commercial license. Read the LICENSING.md before deploying.

The tradeoff with open source is straightforward. You eliminate licensing cost but not maintenance cost. Someone still owns upgrades, security patches, Slack API changes, and the 2 AM call when it breaks. Budget 0.1-0.25 FTE and treat it like a vendor relationship, not a one-time install.

The Hybrid Approach

In practice, few teams go fully build or fully buy. What works best for most is buying or self-hosting the core workflow (alerting, escalation, timeline) and building custom integrations on top. That gets you 80% of the value at 20% of the maintenance burden. This is where AI coding tools genuinely earn their keep: writing glue code between your incident tool and internal systems, not building the core tool itself.

If you go the self-host route with Incidental or incident-bot, treat it like a vendor relationship. Dedicate an owner, budget for regular upgrades, plan for Slack API changes. "It's free" doesn't mean "it's free of work."

And if you're small enough that none of this feels urgent yet, start with a structured Slack workflow and switch when you hit the triggers in the checklist below. Don't prematurely optimize, and don't wait until you're drowning.

Four Questions to Answer Honestly

Before you commit either way, answer these honestly:

Can you name the person who will own this for the next two years? Not "the team" or "we'll rotate it." A specific person with time allocated. If the answer is "we'll figure it out," you should buy.

What happens when that person leaves? If the code is well-documented, tested, and multiple people understand it, you're probably fine. If it's one person's project that nobody else has touched, you're building a liability.

Is your incident tool on separate infrastructure from your product? Because if it shares the same database, the same deploy pipeline, the same SSO, it goes down when your product goes down. Most teams that build in-house make this mistake, and it only becomes obvious during a real P0.

What else could your engineers be working on? A senior engineer spending 25% of their time on an internal incident tool is a senior engineer not spending 25% of their time on your product. At $62K-$100K/year in opportunity cost, that's a real number.

Decision Checklist: When to Buy

Triggers that suggest you're ready for a dedicated incident management platform:

On-call rotation involves ≥8 people
You're handling ≥4 incidents per month
≥3 teams are regularly involved in incident response
You have customer-facing SLAs or enterprise customers asking about incident processes
Compliance requirements exist (audit logs, retention, RBAC)
You need stakeholder updates within 10-15 minutes, reliably
Your current ad-hoc system failed during a real incident

If 3+ apply, you're in buy territory.

When Building Actually Makes Sense

I want to be fair here. There are teams where building is genuinely the right call.

If you have regulatory constraints that no vendor can meet (specific data residency requirements, mandated audit log formats, custom approval workflows tied to proprietary systems), building makes sense. If you're 500+ engineers with complex multi-team incident processes, off-the-shelf tools may not fit, though at that scale you have a team dedicated to internal tooling anyway.

Sometimes building is educational, and that's fine too. Just be honest that it's a learning project, not a production system, and budget for the eventual rewrite.

When it actually works

An 80-person fintech we've talked to had to build because they needed EU data residency for specific customers (specific region, specific provider), custom approval workflows for production access tied to their fraud detection system, audit log formats mandated by regulators that weren't standard JSON, and integrations with internal systems no vendor supported.

Three years later, it's still maintained by 0.3 FTE of an SRE. Total cost was ~$250K-$300K over 3 years, versus maybe $200K-$270K if they'd bought and built all the custom integrations on top. They'd build again, because their requirements stayed genuinely unique.

The key word is "genuinely." Their requirements were regulatory constraints, not preferences. Most teams think they're unique. Few actually are.

Why Most Teams Should Buy

For teams between 20 and 200, buying is almost always the better move. Not because building can't be done (it clearly can) but because the economics don't justify it.

Your custom tool doesn't evolve unless you invest in it. Paid tools ship new features based on what hundreds of teams need. When Slack changes its API, vendors ship updates within weeks because it's their business. You don't own the maintenance, the security patches, or the upgrade cycles.

There's also the exit option. If you build something custom and hate it, you're stuck with it. If you buy and it doesn't work out, you switch. That flexibility is worth more than most teams realize.

And the reliability argument is simple: dedicated incident management vendors have higher uptime requirements than your startup does. Their whole business is being available when your stuff is broken.

What to Buy First

For help choosing, see our best incident management tools for startups or best incident management tools with on-call comparisons.

Don't try to solve everything at once. Start with paging and escalation that reliably works on phone and SMS, timeline capture so you have a record of what happened, and comms templates for stakeholder updates. That's day one.

Within six months, add a status page, basic analytics (MTTR, incident frequency), and a post-incident review workflow. Everything else (advanced reporting, custom integrations, SLA tracking) can wait until you know what you actually need.

Where AI Actually Helps

The highest-value use of AI in incident management isn't building the tool itself. It's features within the tool: auto-generated postmortem drafts, smart alert grouping, runbook suggestions. Apply AI where it saves time during and after incidents, not on maintaining the infrastructure underneath. For a real example, see how AI agents can manage incidents via MCP and what changes when agents handle coordination.

Migration: What Actually Breaks

If you're migrating from a custom build to a commercial tool, expect three kinds of friction.

Incident ID schemes don't map cleanly. Your custom tool used INC-2024-001, the new tool uses #1234, and now every cross-reference in Jira, docs, and Slack is broken. Team habits reset too. Muscle memory around commands, templates, and workflows takes 2-4 weeks to retrain, and the first few weeks feel slower, not faster. And historical metrics become discontinuous when you switch tools mid-year, which makes year-over-year MTTR comparisons messy.

None of these are dealbreakers. But budget 2-4 weeks for the transition and expect a productivity dip.

The Bottom Line

Building has never been easier. That's exactly the trap.

AI tools compress the initial build from weeks to days. But the initial build was never where the money went. Maintenance, reliability, compliance, and the person who owns it. That's the real cost, and AI doesn't touch any of it.

The question worth asking isn't "can we build this?" It's "do we want to own this for the next three years?"

If incident management is core to your business and you have dedicated ownership and separate infrastructure, build. If you want genuinely open source, Incidental and incident-bot are MIT licensed and real options, though you're trading licensing cost for maintenance cost. If you're a 20-200 person team that wants something that works without dedicating engineering time to maintain it, buy. The market is moving toward Slack-first coordination and responder-based pricing; PagerDuty still wins in mature enterprises but is often overkill for teams under 200.

Most teams end up somewhere in between: buy or self-host the core, build the custom parts with AI. That's usually the right answer.

Common Questions

Should we build incident management in-house?

Only if you have a named owner with dedicated time, separate infrastructure from your product, and regulatory or workflow requirements that existing tools genuinely can't handle. For most teams of 20-200 people, the three-year cost of building is 3-8x higher than buying.

What does it actually cost to maintain a custom incident tool?

At 0.25 FTE of a senior engineer ($250K-$400K fully-loaded), you're looking at $62K-$100K/year in maintenance alone. Add infrastructure ($3K-$10K/year) and a rebuild every 18-24 months ($30K-$50K). Over three years, that's $233K-$395K. Most of it is opportunity cost, not infrastructure spend.

When does buying make more sense than building?

When 3+ of these are true: your on-call rotation has 8 or more people, you're running 4+ incidents per month, 3+ teams are involved in response, you have customer-facing SLAs, compliance requirements exist, or your current ad-hoc system already failed during a real incident.

How much does it cost to build an incident management system from scratch?

Initial build with AI tools runs $8K-$15K (1-2 weeks of engineer time). Ongoing maintenance and infrastructure add $65K-$110K/year. Over three years including one rebuild cycle, that totals $233K-$395K. The initial build is 3-6% of the three-year number.

Does AI change the build-vs-buy math?

AI cut the initial build from weeks to days. That saves roughly $10K-$15K in Year 1. But the initial build was never the expensive part. Maintenance, Slack API changes, carrier routing, compliance work, and the bus factor when your builder leaves are all unchanged. AI made the cheapest part cheaper.

60-Second Version

The AI Build: What Changed and What Didn't

More Code, More Incidents

The Build Illusion: Why It Seems Cheaper Than It Is

The Hidden Cost: Dedicated Engineer

The Maintenance Tax

The Policy Surface Nobody Expects

The Reliability Paradox

If You Build Anyway

The Real Cost Comparison (20-Person Company, 3-Year TCO)

How the Numbers Change

The Open Source Path

What's Actually Left

The Hybrid Approach

Four Questions to Answer Honestly

Decision Checklist: When to Buy

When Building Actually Makes Sense

When it actually works

Why Most Teams Should Buy

What to Buy First

Where AI Actually Helps

Migration: What Actually Breaks

The Bottom Line

Common Questions

Share this article

Related Articles

Alert Fatigue: Causes, Examples, and How to Reduce It

Your AI Agent Just Handled That Incident. Now What?

OpsGenie End of Life 2027: Support End Date

Your AI agent already knows your system better than ours ever will

Incident management for early-stage engineering teams

Your Agent Can Manage Incidents Now

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

Slack Incident Management: What Works and What Breaks

PagerDuty Alternatives for Engineering Teams in 2026

Incident Communication Templates: 8 Free Examples [Copy-Paste]

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

Runbook vs Playbook: The Difference That Confuses Everyone

OpsGenie Shutdown 2027: The Complete Migration Guide

How to Reduce MTTR in 2026: The Coordination Framework

SEV0-SEV4 Incident Severity Levels Matrix

Incident Management vs Incident Response: What's the Difference?

State of Incident Management 2026: Toil Rose 30% Despite AI

Slack Incident Response Playbook: Roles, Scripts & Templates

On-Call Rotation: Schedules, Handoffs & Templates

Post-Incident Review Template: Free Examples

Incident Coordination: Cut Context Switching, Fix Faster

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Automate Your Incident Response