Your senior engineer says "give me a weekend with Claude and Cursor, I'll have a working incident bot by Monday." Copilot writes the paging logic. The escalation state machine practically builds itself.
They're right about the first version. They're wrong about the next three years.
We did back-of-napkin math on three-year total cost of ownership for a 20-person engineering team:
Build from scratch: $233K-$395K
Open source (self-host): $99K-$360K (mostly maintainer time)
Buy commercial: $11K-$83K (varies by vendor pricing model)
Sizing model (3-year):
Build_TCO = MVP + (FTE x LoadedCost x 3) + (Infra x 3) + RebuildsOSS_TCO = (FTE x LoadedCost x 3) + (Infra x 3) + MigrationsBuy_TCO = (Subscription x 3) + Onboarding
The bulk of this TCO is engineering time: opportunity cost, not vendor invoices. Building runs 3 to 8 times the cost of buying. Open source sits in the middle. Free to download, not free to run.
This article covers where the money actually goes, what AI tools change (and what they don't), and when building genuinely makes sense.
"You're not spinning up a bot. You're signing up to maintain a system forever."
Disclosure: Runframe builds incident management software. We've included open source options and noted when building is the right call. Found an error? Email hello@runframe.io.
60-Second Version
Under 20 people, no enterprise customers? Structured Slack workflows or incident-bot will get you started. Switch when you hit the limits.
Between 20 and 200 and scaling? Default to buying or self-hosting open source. Only build from scratch if you have real regulatory constraints or incident management is literally your product.
Over 200? You've likely outgrown basic tooling already. This article is mostly aimed at smaller teams, but the cost ratios still hold.
When I say "incident management" here, I mean the full loop: detection, paging, coordination, comms, and post-incident review. Not just "something that wakes people up."
If you just want the checklist, jump to When to Buy.
The AI Build: What Changed and What Didn't
Two years ago, a competent engineer needed 2-4 weeks to build a basic incident management system. Today, with AI coding tools, that's down to days. A weekend if you're scrappy.
AI is genuinely good at scaffolding. Slack bot setup that used to take days now takes hours. Status page templates, database schemas, escalation logic, API layers. The boilerplate disappears fast. No argument there.
But here's what AI doesn't change:
Slack retires APIs. It just does. The legacy file upload method was sunset in Nov 2025, forcing migrations to a newer upload flow. Legacy custom bots were discontinued in Mar 2025, breaking older bot-based workflows. AI can help you migrate faster, but it can't stop the deprecations from happening.
Phone and SMS paging is an ops problem, not a code problem. Carriers filter aggressively, especially internationally. Routing and deliverability are their own discipline. No prompt is going to fix that.
The engineer who leaves is still the single biggest risk. AI may have written the code, but nobody else knows the architecture decisions, the production edge cases, or why that one Slack workaround exists.
SOC2 auditors don't care that Claude wrote your audit log. They care that it's complete, immutable, and retained for the right duration. Compliance is process work, not code work.
And your incident tool needs to work at 2 AM when your infrastructure is failing. AI can't architect around your own blast radius.
The net effect: AI reduced the initial build from ~$19K-$31K (2-4 weeks) to maybe $8K-$15K (1-2 weeks) in engineer time. That saves ~$10K-$15K of Year-1 cost on a $233K-$395K three-year total. The initial build was never the expensive part.
More Code, More Incidents
Before we get into the numbers: the problem you're solving isn't standing still.
AI-assisted development pushes change velocity up for most teams. Faster velocity usually means more incidents, unless review and testing discipline keeps pace. The blast radius gets bigger when AI-generated changes don't get the same scrutiny as hand-written code. More code shipped faster means more things that can break.
The incident management tool you need in year three will almost certainly be bigger than what you need today.
The Build Illusion: Why It Seems Cheaper Than It Is
With AI coding tools, a good engineer can stand up a basic incident system in days:
- Slack bot that creates channels
- Basic status page
- Escalation logic
- Incident history in a database
Looks straightforward. Here's what teams consistently forget.
The Hidden Cost: Dedicated Engineer
Someone needs to own this. Not as a side project. As actual job responsibility.
Example (B2B SaaS running microservices on Kubernetes, ~120 engineers): A team assigned a senior engineer to their custom incident tool "for a quarter." Three years later, it's still a quarter of his time. The original Slack bot had grown to include custom escalation logic, a homegrown status page, and integrations with five internal tools nobody else knew how to maintain.
US high-cost-market senior engineer fully-loaded (salary + benefits + overhead) is often ~$250K-$400K/year. Adjust down ~30-50% for UK/EU typical comp.
Even at 25% allocation, that's $62K-$100K annually in opportunity cost. For one feature.
Sensitivity check: If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract ~$25K-$40K/year from the build cost below. But be honest as 0.1 FTE is rarely enough once the tool is in production.
The Maintenance Tax
SREs have a name for this: the forever-project. What started as a weekend hack becomes a quarter-long effort, then a year-long commitment, then something nobody wants to touch but everyone relies on.
The first three months are fine. The engineer builds it, it works, everyone's happy. Then edge cases start appearing around month four. Slack changes its permission model, or rate limits hit during a real incident, or a new hire asks "why does it work this way?" and nobody has a good answer. The original engineer spends increasing time on support.
Somewhere between month seven and month twelve, the engineer who built it leaves or changes roles. Nobody else understands the code. The team is afraid to touch it. By year two, the tool has real technical debt, nobody wants to work on it, but everyone depends on it.
The Policy Surface Nobody Expects
Once you have an incident system, questions show up that you didn't plan for. Who can declare incidents? Who can close them? How long do you keep the records? Where's the data stored? Can you export it for an audit?
Every internal tool eventually becomes a policy surface. Building the first version is cheap. Keeping up with evolving RBAC, retention, and compliance requirements is where the real time goes.
One pattern we've seen across regulated teams: a 60-person fintech spent ~$80K of engineering time building an incident system. It worked for 18 months. Then Slack platform changes and internal security policy changes hit at the same time. The engineer who built it had left. They spent another ~$40K rewriting it. Six months later, compliance asked for audit trails and data residency. Another ~$30K.
The Reliability Paradox
During a P0, when the database is on fire, customers are angry, and your CEO is watching the Slack channel, your incident tool has to work. Without question.
But most teams host their custom incident tooling on the same infrastructure as their product. Product goes down, incident tool goes down with it. If your internal tool uses the company SSO, you're locked out of your response system the moment your identity provider is part of the outage.
Your incident tool needs different infrastructure than your app. It needs higher availability. It needs separate backups. It needs its own monitoring.
AI tools reduce initial build time. They don't fix the reliability paradox, the policy surface, or the engineer who leaves.
If You Build Anyway
A few things that catch teams off guard: Slack's permission model is more nuanced than it looks, and scoping channel access without granting overly broad permissions is tricky. Bulk operations during real incidents hit rate limits. Phone and SMS paging has deliverability issues that vendors spend years solving. And rebuilds break because nobody remembers why policy X was implemented that way. You either rebuild the wrong thing or spend weeks rediscovering context that left with the original engineer.
If you're going to build regardless, at minimum get these right:
- Separate hosting from production (different failure domain)
- Paging + escalation state machine (including acknowledgements)
- Timeline capture + export (for post-incident review and compliance)
- Audit log of key actions (declare, assign, close)
The Real Cost Comparison (20-Person Company, 3-Year TCO)
Back-of-napkin estimates for a 20-person engineering team. Your specific numbers will differ, but the ratios are what matter.
Build from scratch: $233K-$395K. Self-host open source: $99K-$360K. Buy commercial: $11K-$83K.
Building typically runs 3 to 8x the cost of buying, depending on vendor tier and team size. Open source falls in between. No license fees, but the maintainer time adds up.
Where the numbers come from: Levels.fyi's 2025 report shows ~$312K median total compensation for "Senior Engineer" in the US (base + stock + bonus). We applied a standard 1.25-1.4x multiplier for employer-side costs (benefits, payroll taxes, overhead) to get the $250K-$400K fully-loaded range. Adjust down 30-50% for UK/EU. Infrastructure costs are based on AWS pricing for a 3-AZ highly available setup with separate monitoring. Rebuild risk is informed by the Slack deprecations mentioned above, plus typical security and compliance changes over a 3-year window. The ratio held across every scenario we sketched: build costs 3 to 8 times more than buying.
Assumptions: 1-2 weeks initial build time (AI-assisted), 0.25 FTE ongoing maintenance, separate infrastructure for reliability, and periodic rework every 18-24 months for API changes, compliance, and new features.
Plug in your own numbers:
Inputs:
EngCost = Fully-loaded eng cost/year (default: $300K)
BuildWeeks = Initial build time in weeks (default: 1-2)
FTE = Maintainer allocation (default: 0.25)
Vendor = Vendor $/user/month (default: $15-100; depends on pricing model)
Users = On-call responders (default: 10-15; set to 20 if everyone is a responder)
Infra = Hosting/monitoring per year (default: $5K; set to $0 if N/A)
Rebuild = Migration/rewrite allowance over 3 years (default: $30K; set to $0 if none)
Onboarding = One-time setup/training (default: $5K; set to $0 if self-serve)
Formulas:
Build cost = (EngCost / 52) × BuildWeeks
Run cost/year = EngCost × FTE
Buy cost/year = Vendor × Users × 12
Build 3-yr TCO = Build cost + (Run cost/year × 3) + (Infra × 3) + Rebuild
Buy 3-yr TCO = (Buy cost/year × 3) + Onboarding
Build (3-year TCO):
| Cost | Year 1 | Year 2 | Year 3 | Total |
|---|---|---|---|---|
| Initial build (AI-assisted, ~1-2 weeks) | $8K-$15K | $0 | $0 | $8K-$15K |
| Dedicated maintainer (25% time) | $62K-$100K | $62K-$100K | $62K-$100K | $186K-$300K |
| Infrastructure & hosting* | $3K-$10K | $3K-$10K | $3K-$10K | $9K-$30K |
| Rebuilds & migrations** | $0 | $30K-$50K | $0 | $30K-$50K |
| Total | $73K-$125K | $95K-$160K | $65K-$110K | $233K-$395K |
*Depends on HA requirements, pager/telephony, audit logging, retention, data residency.
**Triggered by Slack platform changes, org restructuring, compliance requirements, new escalation policies, or new integrations.
Buy (3-year TCO example for a 20-person company):
| Cost | Year 1 | Year 2 | Year 3 | Total |
|---|---|---|---|---|
| Responder-based pricing (10-15 users × $15-30/mo)*** | $2K-$5K | $2K-$5K | $2K-$5K | $5K-$16K |
| Enterprise per-seat pricing (20 users × $40-100/mo)*** | $10K-$24K | $10K-$24K | $10K-$24K | $29K-$72K |
| Onboarding & setup | $3K-$8K | $0 | $0 | $3K-$8K |
| Total (responder-based) | $5K-$13K | $2K-$5K | $2K-$5K | $11K-$27K |
| Total (enterprise per-seat) | $13K-$32K | $10K-$24K | $10K-$24K | $35K-$83K |
***Vendor pricing varies widely. Responder-based tools (pricing per on-call user) are typical for startups and mid-size teams. Enterprise per-seat licensing (pricing per employee) is common with PagerDuty, Opsgenie, and similar tools at higher tiers.
Open source / self-host (3-year TCO example for a 20-person company). Totals below show the same table under two maintainer assumptions (0.1 FTE optimistic vs 0.25 FTE typical):
| Cost | Year 1 | Year 2 | Year 3 | Total |
|---|---|---|---|---|
| Dedicated maintainer (0.1-0.25 FTE) | $25K-$100K | $25K-$100K | $25K-$100K | $75K-$300K |
| Infrastructure & hosting* | $3K-$10K | $3K-$10K | $3K-$10K | $9K-$30K |
| Upgrades & migrations** | $0 | $15K-$30K | $0 | $15K-$30K |
| Total (0.1 FTE) | $28K-$50K | $43K-$80K | $28K-$50K | $99K-$180K |
| Total (0.25 FTE typical) | $65K-$110K | $80K-$140K | $65K-$110K | $210K-$360K |
0.1 FTE is optimistic (works if you're deploying a mature tool with minimal customization). 0.25 FTE is typical once you're running it in production with Slack integrations and on-call routing.
*Depends on HA requirements, audit logging/retention, and whether you run paging/telephony yourself.
**Common triggers: Slack API changes, auth/security model changes, major version upgrades, or compliance asks (RBAC/audit/retention).
Sensitivity check: Your numbers will differ based on location, team structure, and requirements. If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract ~$25K-$40K/year from build costs. If you avoid rebuilds entirely, subtract another $30K-$50K. The gap narrows but rarely closes — under typical assumptions, build costs 3-7x more than buy over three years.
The gap is wider than most teams think. And the build model assumes nothing catastrophic happens - no major rewrites, no security incidents, no key engineer departures.
How the Numbers Change
Most arguments about build vs buy come down to two variables: how much time the maintainer actually spends, and how the vendor prices seats.
If you're optimistic and assume 0.1 FTE with no rebuilds, build drops to ~$92K-$165K over 3 years. That narrows the gap with buying considerably. But 0.1 FTE rarely holds once the tool is in production and people start requesting features.
Under typical assumptions (0.25 FTE, one rebuild or migration event, normal Slack and compliance churn), build and self-host run 3-8x the buy-side cost.
The one scenario where buying looks less attractive: if your vendor prices per employee rather than per responder, and you're forced into a higher enterprise tier. In that case, self-hosting can be rational, but only if you can name an owner and accept the upgrade burden.
The Open Source Path
Open source is a legitimate option if you want to avoid both building from zero and paying license fees. But the landscape shrank considerably in 2025.
Netflix archived Dispatch in September 2025. It was the most production-ready self-hosted option for years. It's read-only forever now. Netflix had hundreds of engineers maintaining it and still walked away.
Grafana closed-sourced OnCall. The OSS version entered maintenance mode in March 2025 and is scheduled to be fully archived on 2026-03-24. Cloud connection, SMS, phone, and push notifications all stop working after that date. Grafana consolidated everything into a closed-source Cloud IRM product.
Two of the biggest names in open source incident management either archived or closed-sourced their tools in the same twelve-month window. That's the context for what follows.
What's Actually Left
Incidental has Slack integration and status pages, with a hosted option at incidental.dev. It's the most capable truly open source option remaining, though it's still early-stage (v0.1.0).
incident-bot (docs) is Slack-based, self-hostable, Python/PostgreSQL. Integrates with PagerDuty, Jira, Confluence, Statuspage, GitLab, and Zoom. Smaller project, limited on compliance and RBAC out of the box.
Both are MIT licensed. Both are small projects compared to what Dispatch and Grafana OnCall were.
Also worth knowing: IncidentFox is an AI-powered SRE platform. The core is Apache 2.0, but the production security layer (sandbox isolation, credential injection) is BSL 1.1, meaning production use of those components requires a commercial license. Read the LICENSING.md before deploying.
The tradeoff with open source is straightforward. You eliminate licensing cost but not maintenance cost. Someone still owns upgrades, security patches, Slack API changes, and the 2 AM call when it breaks. Budget 0.1-0.25 FTE and treat it like a vendor relationship, not a one-time install.
The Hybrid Approach
In practice, few teams go fully build or fully buy. What works best for most is buying or self-hosting the core workflow (alerting, escalation, timeline) and building custom integrations on top. That gets you 80% of the value at 20% of the maintenance burden. This is where AI coding tools genuinely earn their keep: writing glue code between your incident tool and internal systems, not building the core tool itself.
If you go the self-host route with Incidental or incident-bot, treat it like a vendor relationship. Dedicate an owner, budget for regular upgrades, plan for Slack API changes. "It's free" doesn't mean "it's free of work."
And if you're small enough that none of this feels urgent yet, start with a structured Slack workflow and switch when you hit the triggers in the checklist below. Don't prematurely optimize, and don't wait until you're drowning.
Four Questions to Answer Honestly
Before you commit either way, answer these honestly:
Can you name the person who will own this for the next two years? Not "the team" or "we'll rotate it." A specific person with time allocated. If the answer is "we'll figure it out," you should buy.
What happens when that person leaves? If the code is well-documented, tested, and multiple people understand it, you're probably fine. If it's one person's project that nobody else has touched, you're building a liability.
Is your incident tool on separate infrastructure from your product? Because if it shares the same database, the same deploy pipeline, the same SSO — it goes down when your product goes down. Most teams that build in-house make this mistake, and it only becomes obvious during a real P0.
What else could your engineers be working on? A senior engineer spending 25% of their time on an internal incident tool is a senior engineer not spending 25% of their time on your product. At $62K-$100K/year in opportunity cost, that's a real number.
Decision Checklist: When to Buy
Triggers that suggest you're ready for a dedicated incident management platform:
- On-call rotation involves ≥8 people
- You're handling ≥4 incidents per month
- ≥3 teams are regularly involved in incident response
- You have customer-facing SLAs or enterprise customers asking about incident processes
- Compliance requirements exist (audit logs, retention, RBAC)
- You need stakeholder updates within 10-15 minutes, reliably
- Your current ad-hoc system failed during a real incident
If 3+ apply, you're in buy territory.
When Building Actually Makes Sense
I want to be fair here. There are teams where building is genuinely the right call.
If you have regulatory constraints that no vendor can meet (specific data residency requirements, mandated audit log formats, custom approval workflows tied to proprietary systems), building makes sense. If you're 500+ engineers with complex multi-team incident processes, off-the-shelf tools may not fit, though at that scale you have a team dedicated to internal tooling anyway.
Sometimes building is educational, and that's fine too. Just be honest that it's a learning project, not a production system, and budget for the eventual rewrite.
When it actually works
An 80-person fintech we've talked to had to build because they needed EU data residency for specific customers (specific region, specific provider), custom approval workflows for production access tied to their fraud detection system, audit log formats mandated by regulators that weren't standard JSON, and integrations with internal systems no vendor supported.
Three years later, it's still maintained by 0.3 FTE of an SRE. Total cost was ~$250K-$300K over 3 years, versus maybe $200K-$270K if they'd bought and built all the custom integrations on top. They'd build again, because their requirements stayed genuinely unique.
The key word is "genuinely." Their requirements were regulatory constraints, not preferences. Most teams think they're unique. Few actually are.
Why Most Teams Should Buy
For teams between 20 and 200, buying is almost always the better move. Not because building can't be done (it clearly can) but because the economics don't justify it.
Your custom tool doesn't evolve unless you invest in it. Paid tools ship new features based on what hundreds of teams need. When Slack changes its API, vendors ship updates within weeks because it's their business. You don't own the maintenance, the security patches, or the upgrade cycles.
There's also the exit option. If you build something custom and hate it, you're stuck with it. If you buy and it doesn't work out, you switch. That flexibility is worth more than most teams realize.
And the reliability argument is simple: dedicated incident management vendors have higher uptime requirements than your startup does. Their whole business is being available when your stuff is broken.
What to Buy First
Don't try to solve everything at once. Start with paging and escalation that reliably works on phone and SMS, timeline capture so you have a record of what happened, and comms templates for stakeholder updates. That's day one.
Within six months, add a status page, basic analytics (MTTR, incident frequency), and a post-incident review workflow. Everything else — advanced reporting, custom integrations, SLA tracking — can wait until you know what you actually need.
Where AI Actually Helps
The highest-value use of AI in incident management isn't building the tool itself. It's features within the tool: auto-generated postmortem drafts, smart alert grouping, runbook suggestions. Apply AI where it saves time during and after incidents, not on maintaining the infrastructure underneath.
Migration: What Actually Breaks
If you're migrating from a custom build to a commercial tool, expect three kinds of friction.
Incident ID schemes don't map cleanly. Your custom tool used INC-2024-001, the new tool uses #1234, and now every cross-reference in Jira, docs, and Slack is broken. Team habits reset too — muscle memory around commands, templates, and workflows takes 2-4 weeks to retrain, and the first few weeks feel slower, not faster. And historical metrics become discontinuous when you switch tools mid-year, which makes year-over-year MTTR comparisons messy.
None of these are dealbreakers. But budget 2-4 weeks for the transition and expect a productivity dip.
The Bottom Line
Building has never been easier. That's exactly the trap.
AI tools compress the initial build from weeks to days. But the initial build was never where the money went. Maintenance, reliability, compliance, and the person who owns it. That's the real cost, and AI doesn't touch any of it.
The question worth asking isn't "can we build this?" It's "do we want to own this for the next three years?"
If incident management is core to your business and you have dedicated ownership and separate infrastructure, build. If you want genuinely open source, Incidental and incident-bot are MIT licensed and real options, though you're trading licensing cost for maintenance cost. If you're a 20-200 person team that wants something that works without dedicating engineering time to maintain it, buy. The market is moving toward Slack-first coordination and responder-based pricing; PagerDuty still wins in mature enterprises but is often overkill for teams under 200.
Most teams end up somewhere in between: buy or self-host the core, build the custom parts with AI. That's usually the right answer.