ai-agentsmcpincident-management

Your AI agent already knows your system better than ours ever will

Every incident management vendor is building their own AI. We think that's backwards. Your agent already has the context. It just needs an API to act on incidents.

Niketa SharmaMar 28, 20268 min read

Every incident management vendor just shipped an AI agent. PagerDuty has one. incident.io has one. Even Linear just announced that agents are their entire future.

The pitch is always the same: "Our AI understands your incidents."

Here's the problem. Their AI doesn't know your codebase. It doesn't know that your payments service was rewritten last month, or that deploy #4,271 changed the retry logic, or that the last three outages were all caused by the same Redis connection pool. Their AI reads your incident titles and severity levels. That's it.

Your agent, the one running in Cursor or Claude Code or your custom pipeline, already knows all of that. It's read your code. It's seen your commits. It's helped you debug at 2 AM.

It just can't create an incident, page someone, or update a timeline. That's not an AI problem. That's an API problem.

The captive agent trap

Here's what's happening across the industry right now.

PagerDuty builds an AI agent that lives inside PagerDuty. It can summarize incidents and suggest runbooks, but only PagerDuty incidents, only PagerDuty runbooks. It doesn't know your deploy pipeline or your architecture.

incident.io builds a copilot that helps during incidents. It's useful inside their product. But it doesn't connect to your IDE, your CI/CD, your monitoring dashboards, or the agent that already knows your system.

Linear just announced agents as a core part of their product. Skills, automations, code intelligence, all built into Linear. Their framing is "the shared product system that turns context into execution."

Each of these is a captive agent. It lives inside the vendor's product, operates on the vendor's data, and sees your world through the vendor's lens.

The pitch sounds good in a demo. In practice, you end up with five different AI agents across five different tools, none of which talk to each other, each with a partial view of what's actually happening.

Why your own agent has more context

Think about what your agent already knows when an alert fires.

It's read the service that's failing. It knows the recent changes, can grep for the function that's throwing errors, and can tell you what changed in the last three deploys. It knows the payments service calls the billing service which calls Stripe. If it's been in your repo for a few weeks, it's picked up your deploy cadence, your branch strategy, how you test things. It's seen your postmortems.

An agent with access to Datadog or Grafana can correlate the alert with metrics, logs, and traces before anyone opens a browser tab.

No vendor-built AI will ever have this context. They'd need access to your entire codebase, your deploy history, your monitoring stack, and your team's communication patterns. That's not something you hand to every SaaS tool you use.

The API problem, not the AI problem

When your agent sees an alert, it can diagnose what's wrong. What it can't do without the right API is act on it.

It can't create an incident in your system of record. It can't check who's on call and page them. It can't escalate when no one responds. It can't log what it found to the timeline so the human responder walks in with full context.

This is an integration problem. The agent needs an API that lets it participate in the incident lifecycle the same way a human would.

That's what we built. (If you're weighing whether to build this yourself, we wrote up the three-year TCO math on build vs buy.)

npx @runframe/mcp-server --setup

A tightly scoped MCP server for the incident lifecycle, plus a full REST API. Your agent creates incidents, acknowledges them, pages responders, logs findings, escalates, and drafts postmortems. Doesn't matter if the agent is Claude, GPT, a custom model, or something that doesn't exist yet.

What this looks like in practice

An engineer is working in Cursor. A Datadog alert fires for elevated latency on the payments service.

Their agent, which already has the repo open, checks recent deploys, finds a retry logic change merged two hours ago, and creates an incident in Runframe with the relevant context. It checks who's on call, pages them with a summary that includes the suspected commit, and logs everything to the incident timeline.

The on-call engineer opens Slack, sees the page, and finds a timeline that already contains the alert details, the suspected root cause, the relevant commit, and a link to the diff. They're diagnosing in 30 seconds instead of 10 minutes.

No vendor AI did this. The engineer's own agent did, because it had the context and the API to act.

Captive vs. open: the architectural bet

This is a real architectural decision, not a marketing angle.

Captive agents are built by the vendor, trained on the vendor's data, and locked to the vendor's product. Easy to demo. Hard to extend. When you switch tools, the AI doesn't come with you.

Open agents are yours. They run in your IDE, your CI/CD, your custom pipelines. They use whatever model you want. When you switch vendors, the agent stays.

Captive agent | Open agent
Captive agent Open agent
Context Only what the vendor sees Your entire codebase + infra
Model Vendor's choice Your choice
Portability Locked to vendor Works across tools
Customization Vendor's features Your workflows
Cost Bundled (opaque) You control spend

Cursor, Claude Code, VS Code, Windsurf all support MCP. The agent that helps you write code is the same agent that should help you respond to incidents. The industry is heading that direction whether any individual vendor likes it or not.

"But isn't MCP dead?"

You've seen the posts. Perplexity's CTO moved away from MCP. Eric Holmes wrote "MCP is dead. Long live the CLI." A database MCP server with 106 tools burned 54,600 tokens just on tool discovery before doing anything useful. Security researchers found OAuth flaws, prompt injection vectors, and tool poisoning across open MCP servers.

These are real criticisms. And they mostly apply to MCP servers that shouldn't be MCP servers.

A database with 106 query tools? That's a bad MCP server. Of course the token overhead is brutal. You're asking the agent to discover and evaluate 106 tools it probably doesn't need. A CLI wrapper for git commands? Probably better as a CLI.

Runframe's MCP server is tightly scoped to one domain: the incident lifecycle. Create, acknowledge, escalate, page, resolve. An agent doesn't need to evaluate 106 options. It needs to manage an incident. The tool discovery overhead is minimal because the tool set is focused.

The critics are right that MCP isn't the answer for everything. But they're wrong that it's dead. 97 million monthly SDK downloads. 17,000+ servers. OpenAI, Google, Microsoft, and AWS all adopted it. The Linux Foundation is stewarding it as an open standard. Bloomberg cut deployment timelines from days to minutes.

What actually died is the hype phase. The "just add MCP to everything" era. What replaced it is pragmatic adoption: use MCP where agent-driven tool discovery matters, use direct APIs where the workflow is stable and known.

Incident management is one of the places where MCP fits well. An agent doesn't know ahead of time whether it'll need to create an incident, or just check who's on call, or escalate. The workflow depends on what's happening. That's what tool discovery is for.

And for teams that prefer direct API calls? We ship a full REST API too. Same capabilities, different interface. Use whatever your agent prefers.

What we're building first

We're not starting with a Runframe AI agent. We're starting with the API and MCP server that lets your agent operate through us.

Your team is already choosing its AI stack. Claude, GPT, open-source models, custom agents wired into deploy pipelines. The bigger gap right now isn't another vendor AI — it's that your agent can't create an incident, page someone, or write a postmortem.

That's what we're fixing first. An incident management platform that your existing agent can operate through. A system of record with a clean API and MCP support, so the agent you already trust can participate in the incident lifecycle.

That's Runframe.

The bottom line

Linear just called themselves "the shared product system that turns context into execution." That's a good line. Here's ours: Runframe is the incident system of record that your agents operate through.

Not our agent. Yours. We provide the API, the MCP server, the data model, and the notification system. Your agent provides the context.

Every incident management vendor is racing to build their own AI. PagerDuty, incident.io, Rootly, they're all shipping captive agents that live inside their products. We think this gets the architecture wrong. The best AI for your incidents is the one that already knows your code, your deploys, and your team's patterns. That's your agent, not ours.

What your agent needs is access. A clean API and MCP server that lets it participate in the incident lifecycle. We built that.

npx @runframe/mcp-server --setup

One MCP server, scoped to the incident lifecycle. Works with Cursor, Claude Code, VS Code, and Claude Desktop. Here's how to set it up.

Common questions

Doesn't this mean Runframe has no AI features?
We have AI-powered postmortem drafts (your choice of Claude or GPT). But our primary AI strategy is being the platform your agents interact with, not building a competing agent.
What if I don't use AI agents yet?
Runframe works the same as any incident management tool. Slack integration, on-call scheduling, escalation policies, postmortems. The MCP server and API are there when you're ready.
Which AI models work with the MCP server?
Any model that supports MCP or can make API calls. Claude, GPT, Gemini, open-source models, custom agents. The MCP server is model-agnostic.
Isn't MCP dead?
The hype phase is over. The "add MCP to everything" era. What's left is pragmatic adoption: 97M monthly SDK downloads, Linux Foundation governance, adoption by every major AI provider. MCP makes sense where workflows are dynamic and tool discovery matters. Incident management is exactly that. For stable pipelines, we also ship a full REST API.
How is this different from just having an API?
The MCP server handles tool discovery and structured inputs/outputs — it's a higher-level interface than raw REST calls, built for how agents actually work. If your agent prefers direct API calls, the v1 REST API covers the same capabilities.
What about data privacy? Does my agent send incident data to a model?
Your agent, your model, your data flow. We don't sit in the middle. The MCP server talks to Runframe's API. What your agent does with the data depends on your model provider and your configuration.

Share this article

Found this helpful? Share it with your team.

Related Articles

Mar 24, 2026

Incident management for early-stage engineering teams

How to set up incident management for early-stage engineering teams. Severity levels, on-call, escalation, and postmortems in the right order. Defaults that work from 15 to 100 engineers.

Read more
Mar 16, 2026

Your Agent Can Manage Incidents Now

We shipped an MCP server for managing incidents from Claude Code and Cursor. On-call, escalation, paging, and postmortems. Here's how we designed it for agents that live in your IDE.

Read more
Mar 13, 2026

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

OpsGenie shuts down April 2027. Two vendors got acquired, one went maintenance-only. Here's what's left, what it really costs, and how to decide.

Read more
Mar 10, 2026

Build, Open Source, or Buy Incident Management in 2026

Back-of-napkin 3-year TCO for a 20-person team: build ($233K to $395K), open source ($99K to $360K), or buy ($11K to $83K). What AI changes and what it doesn't.

Read more
Mar 8, 2026

Slack Incident Management: What Works, What Breaks, and When You Need a Tool

A practical guide to running incidents in Slack. What actually works at different team sizes, where Slack falls apart, and when to move beyond emoji reactions and manual channels.

Read more
Mar 5, 2026

PagerDuty Alternatives 2026: Pricing and Features Compared

Which PagerDuty alternative fits your team? Pricing, integrations, and on-call compared for teams from 10 to 200+ engineers.

Read more
Feb 1, 2026

Incident Communication Templates: 8 Free Examples [Copy-Paste]

Stop writing updates at 2 AM. 8 free templates for status pages, exec emails, customer updates, and social posts. Copy and use in 2 minutes.

Read more
Jan 26, 2026

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Step-by-step export guide, timeline, and pricing for 7 alternatives. Most teams need 6-8 weeks.

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 17, 2026

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Stop debating SEV1 vs P1. Covers both SEV and P0–P4 frameworks. Free copy-paste matrix, decision tree, and rollout plan.

Read more
Jan 15, 2026

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

State of Incident Management 2026: Toil Rose 30% Despite AI

~$9.4M wasted per 250 engineers annually. Toil rose 30% in 2025, the first increase in 5 years. Data from 20+ reports and 25+ team interviews.

Read more
Jan 7, 2026

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Read more
Jan 2, 2026

On-Call Rotation: Schedules, Handoffs & Templates

Build a fair on-call rotation with schedule templates, a 2-minute handoff checklist, and primary/backup examples. Includes a free on-call builder tool.

Read more
Dec 29, 2025

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Stop writing postmortems nobody reads. 3 blameless templates (15-min, standard, comprehensive). Copy in one click, done in 48 hours.

Read more
Dec 22, 2025

Incident Coordination Guide: Cut Context Switching and Improve Response Time

Outages cost less than the coordination chaos around them. The 10-minute framework 25+ teams use to reduce coordination overhead and context switching during incidents.

Read more
Dec 15, 2025

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates, all without leaving Slack.

Get Started Free