mcpmcp-serverai-agents

Your Agent Can Manage Incidents Now

We shipped an MCP server for managing incidents from Claude Code and Cursor. On-call, escalation, paging, and postmortems. Here's how we designed it for agents that live in your IDE.

Runframe TeamMar 16, 20268 min read

An engineer on your team gets a Datadog alert while writing code in Cursor. Without switching tabs, their agent checks who's on call, acknowledges the incident, investigates recent deploys, pages the right responder, and logs everything to the timeline.

That's not a demo. That's what Runframe's MCP server does in Cursor and Claude Code today.

npx @runframe/mcp-server --setup

Works with Cursor, Claude Code, VS Code, and Claude Desktop.

Every incident management tool today assumes a human is clicking through every step. We built the MCP server for the workflows where that's no longer true, where an agent does the coordination and the engineer makes the calls.

What's in the box

Here's what we ship.

Incidents (9 tools):

  • list_incidents — filter by status, severity, team
  • get_incident — full details with timeline and participants
  • create_incident — spin one up from an alert
  • update_incident — change severity, assignment, description
  • change_incident_status — move through the workflow (investigating → fixing → resolved)
  • acknowledge_incident — ack it, auto-assign, track SLA
  • add_incident_event — log findings to the timeline
  • escalate_incident — escalate through the policy
  • page_someone — page a responder via Slack or email

On-call (1 tool):

  • get_current_oncall — who's on call right now, filterable by team

Services (2 tools):

  • list_services — search across services
  • get_service — details plus on-call instructions

Postmortems (2 tools):

  • create_postmortem — draft with root cause and action items
  • get_postmortem — pull up what happened

Teams (2 tools):

  • list_teams — see all teams
  • get_escalation_policy — who gets paged at each level

How an agent runs an incident

A Datadog alert fires for elevated API latency on the payments service.

First thing the agent does is call get_incident. SEV2, payments service, opened 3 minutes ago. The monitoring integration already logged the trigger on the timeline.

Then get_current_oncall, filtered to the payments team. Gets back the primary on-call engineer.

acknowledge_incident. The incident moves to "investigating." SLA clock starts. The rest of the team can see someone's on it.

The agent pulls logs from Datadog (separate MCP server), checks recent commits in the codebase, and finds a deploy 20 minutes ago that changed the payment retry logic. It calls add_incident_event with what it found: "Likely caused by deploy #1847, payment retry logic change at 14:32 UTC. Error rate spiked 4 minutes after deploy."

page_someone. The on-call engineer gets a Slack DM and email with the full context and the agent's findings. They don't start from zero.

change_incident_status to "fixing." The timeline has the whole story. When the fix ships, the engineer resolves it, or the CI/CD pipeline does via the API.

Later, create_postmortem with the root cause, timeline, and suggested action items. The engineer reviews and edits instead of writing from scratch.

A handful of calls. The agent did the running around. The engineer decided what to actually do about it.

Why we kept the tool set small

Most incident management MCP servers fall into two camps: auto-generated (every API endpoint becomes a tool, you end up with 70-100 in context) or hand-crafted but sprawling (30-70 tools covering every possible use case). Agents struggle with both.

Each tool definition costs 200-400 tokens (name, description, input schema). A server with 70+ tools burns tens of thousands of tokens before the agent even starts on your problem.

But the token cost is only part of it. The fewer tools an agent has to choose from, the more reliably it picks the right one. When there's one way to list incidents and one way to get an incident, the agent doesn't have to guess between list_incidents, get_incidents, search_incidents, and query_incidents.

We started with the workflow (what does an agent need to run an incident from alert to postmortem?) and worked backward to the tool set. No bulk operations. No user management. No webhook CRUD. No billing endpoints. If it doesn't help an agent run an incident, it stays out.

MCP works when you design for agents

There's a growing chorus that MCP is overhyped. That agents can't reliably use tools. That the whole thing is a gimmick.

We think it comes down to design. MCP (Model Context Protocol) does exactly what it says: lets an agent call tools with structured inputs and get structured outputs back. When an MCP server has well-named, well-described tools scoped to a single workflow, agents use them reliably. We've tested it.

The trick is treating tool design the same way you'd treat API design. Clear names. Descriptions written for LLMs, not humans reading docs. Each tool answers one question an agent would actually ask.

Getting started

Interactive setup (walks you through it):

npx @runframe/mcp-server --setup

Claude Code:

claude mcp add runframe -e RUNFRAME_API_KEY=rf_your_key -- npx -y @runframe/mcp-server

Cursor / VS Code, add to your MCP config:

{
  "mcpServers": {
    "runframe": {
      "command": "npx",
      "args": ["-y", "@runframe/mcp-server"],
      "env": { "RUNFRAME_API_KEY": "rf_your_key" }
    }
  }
}

Get your API key from Settings → API Keys. Scoped permissions, so give the key only what it needs.

Start a free 28-day trial at runframe.io, no credit card required. MCP is included. MIT licensed, source on GitHub.

What's next

We're going to be laser-focused on adding only what agents actually need. If a tool doesn't make an agent better at handling incidents, it doesn't ship.

On the short list:

  • Slack channel tools (create incident channels, post updates)
  • Analytics (MTTR trends, incident frequency by service)
  • Incident templates

That's it for now. We'd rather have 20 tools that work than 70 that look good in a README.

Common questions

What about write safety?
Tools that send real notifications (like page_someone and escalate_incident) are clearly marked as destructive in their descriptions, so the agent knows to confirm before firing them. API keys are scoped, so you can give a key read-only access if you want.
Can I self-host it?
The MCP server runs locally via stdio (default) or as an HTTP server you deploy yourself. There's a Dockerfile included. The server calls Runframe's API, so your data stays in Runframe. The MCP server doesn't store anything.
Is there an HTTP transport for CI/CD pipelines?
Yes. Run with --transport http --port 3100. It takes a bearer token for auth, supports multiple clients, and is stateless so you can load-balance it.

Your agent is already in the IDE. Now it has an incident management layer that keeps up.

Get started →

Share this article

Found this helpful? Share it with your team.

Related Articles

Mar 24, 2026

Incident management for early-stage engineering teams

How to set up incident management for early-stage engineering teams. Severity levels, on-call, escalation, and postmortems in the right order. Defaults that work from 15 to 100 engineers.

Read more
Mar 13, 2026

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

OpsGenie shuts down April 2027. Two vendors got acquired, one went maintenance-only. Here's what's left, what it really costs, and how to decide.

Read more
Mar 10, 2026

Build, Open Source, or Buy Incident Management in 2026

Back-of-napkin 3-year TCO for a 20-person team: build ($233K to $395K), open source ($99K to $360K), or buy ($11K to $83K). What AI changes and what it doesn't.

Read more
Mar 8, 2026

Slack Incident Management: What Works, What Breaks, and When You Need a Tool

A practical guide to running incidents in Slack. What actually works at different team sizes, where Slack falls apart, and when to move beyond emoji reactions and manual channels.

Read more
Mar 5, 2026

PagerDuty Alternatives 2026: Pricing and Features Compared

Which PagerDuty alternative fits your team? Pricing, integrations, and on-call compared for teams from 10 to 200+ engineers.

Read more
Feb 1, 2026

Incident Communication Templates: 8 Free Examples [Copy-Paste]

Stop writing updates at 2 AM. 8 free templates for status pages, exec emails, customer updates, and social posts. Copy and use in 2 minutes.

Read more
Jan 26, 2026

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Step-by-step export guide, timeline, and pricing for 7 alternatives. Most teams need 6-8 weeks.

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 17, 2026

Incident Severity Levels: SEV0–SEV4 Matrix [Free Template]

Stop debating SEV1 vs P1. Covers both SEV and P0–P4 frameworks. Free copy-paste matrix, decision tree, and rollout plan.

Read more
Jan 15, 2026

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

State of Incident Management 2026: Toil Rose 30% Despite AI

~$9.4M wasted per 250 engineers annually. Toil rose 30% in 2025, the first increase in 5 years. Data from 20+ reports and 25+ team interviews.

Read more
Jan 7, 2026

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Read more
Jan 2, 2026

On-Call Rotation: Schedules, Handoffs & Templates

Build a fair on-call rotation with schedule templates, a 2-minute handoff checklist, and primary/backup examples. Includes a free on-call builder tool.

Read more
Dec 29, 2025

Post-Incident Review Template: 3 Free Examples [Copy & Paste]

Stop writing postmortems nobody reads. 3 blameless templates (15-min, standard, comprehensive). Copy in one click, done in 48 hours.

Read more
Dec 22, 2025

Incident Coordination Guide: Cut Context Switching and Improve Response Time

Outages cost less than the coordination chaos around them. The 10-minute framework 25+ teams use to reduce coordination overhead and context switching during incidents.

Read more
Dec 15, 2025

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates, all without leaving Slack.

Get Started Free