Your Agent Can Manage Incidents Now

Q: What about write safety?

Tools that send real notifications (like `page_someone` and `escalate_incident`) are clearly marked as destructive in their descriptions, so the agent knows to confirm before firing them. API keys are scoped, so you can give a key read-only access if you want.

Q: Is there an HTTP transport for CI/CD pipelines?

Yes. Run with `--transport http --port 3100`. It takes a bearer token for auth, supports multiple clients, and is stateless so you can load-balance it.

An engineer on your team gets a Datadog alert while writing code in Cursor. Without switching tabs, their agent checks who's on call, acknowledges the incident, investigates recent deploys, pages the right responder, and logs everything to the timeline.

That's not a demo. That's what Runframe's MCP server does in Cursor and Claude Code today.

npx @runframe/mcp-server --setup

Works with Cursor, Claude Code, VS Code, and Claude Desktop.

Every incident management tool today assumes a human is clicking through every step. We built the MCP server for the workflows where that's no longer true, where an agent does the coordination and the engineer makes the calls.

What's in the box

Here's what we ship.

Incidents (9 tools):

list_incidents — filter by status, severity, team
get_incident — full details with timeline and participants
create_incident — spin one up from an alert
update_incident — change severity, assignment, description
change_incident_status — move through the workflow (investigating → fixing → resolved)
acknowledge_incident — ack it, auto-assign, track SLA
add_incident_event — log findings to the timeline
escalate_incident — escalate through the policy
page_someone — page a responder via Slack or email

On-call (1 tool):

get_current_oncall — who's on call right now, filterable by team

Services (2 tools):

list_services — search across services
get_service — details plus on-call instructions

Postmortems (2 tools):

create_postmortem — draft with root cause and action items
get_postmortem — pull up what happened

Teams (2 tools):

list_teams — see all teams
get_escalation_policy — who gets paged at each level

How an agent runs an incident

A Datadog alert fires for elevated API latency on the payments service.

First thing the agent does is call get_incident. SEV2, payments service, opened 3 minutes ago. The monitoring integration already logged the trigger on the timeline.

Then get_current_oncall, filtered to the payments team. Gets back the primary on-call engineer.

acknowledge_incident. The incident moves to "investigating." SLA clock starts. The rest of the team can see someone's on it.

The agent pulls logs from Datadog (separate MCP server), checks recent commits in the codebase, and finds a deploy 20 minutes ago that changed the payment retry logic. It calls add_incident_event with what it found: "Likely caused by deploy #1847, payment retry logic change at 14:32 UTC. Error rate spiked 4 minutes after deploy."

page_someone. The on-call engineer gets a Slack DM and email with the full context and the agent's findings. They don't start from zero.

change_incident_status to "fixing." The timeline has the whole story. When the fix ships, the engineer resolves it, or the CI/CD pipeline does via the API.

Later, create_postmortem with the root cause, timeline, and suggested action items. The engineer reviews and edits instead of writing from scratch.

A handful of calls. The agent did the running around. The engineer decided what to actually do about it.

Why we kept the tool set small

Most incident management MCP servers fall into two camps: auto-generated (every API endpoint becomes a tool, you end up with 70-100 in context) or hand-crafted but sprawling (30-70 tools covering every possible use case). Agents struggle with both. This philosophy aligns with what many teams are looking for when they evaluate top PagerDuty alternatives—tools that do the essential work well, without overwhelming complexity.

Each tool definition costs 200-400 tokens (name, description, input schema). A server with 70+ tools burns tens of thousands of tokens before the agent even starts on your problem.

But the token cost is only part of it. The fewer tools an agent has to choose from, the more reliably it picks the right one. When there's one way to list incidents and one way to get an incident, the agent doesn't have to guess between list_incidents, get_incidents, search_incidents, and query_incidents.

We started with the workflow (what does an agent need to run an incident from alert to postmortem?) and worked backward to the tool set. No bulk operations. No user management. No webhook CRUD. No billing endpoints. If it doesn't help an agent run an incident, it stays out.

MCP works when you design for agents

There's a growing chorus that MCP is overhyped. That agents can't reliably use tools. That the whole thing is a gimmick.

We think it comes down to design. MCP (Model Context Protocol) does exactly what it says: lets an agent call tools with structured inputs and get structured outputs back. When an MCP server has well-named, well-described tools scoped to a single workflow, agents use them reliably. We've tested it.

The trick is treating tool design the same way you'd treat API design. Clear names. Descriptions written for LLMs, not humans reading docs. Each tool answers one question an agent would actually ask.

Getting started

Interactive setup (walks you through it):

npx @runframe/mcp-server --setup

Claude Code:

claude mcp add runframe -e RUNFRAME_API_KEY=rf_your_key -- npx -y @runframe/mcp-server

Cursor / VS Code, add to your MCP config:

{
  "mcpServers": {
    "runframe": {
      "command": "npx",
      "args": ["-y", "@runframe/mcp-server"],
      "env": { "RUNFRAME_API_KEY": "rf_your_key" }
    }
  }
}

Get your API key from Settings → API Keys after signing in. Scoped permissions, so give the key only what it needs.

Start a free 28-day trial at runframe.io, no credit card required. MCP is included. MIT licensed, source on GitHub.

What's next

We're going to be laser-focused on adding only what agents actually need. If a tool doesn't make an agent better at handling incidents, it doesn't ship.

On the short list:

Slack channel tools (create incident channels, post updates)
Analytics (MTTR trends, incident frequency by service)
Incident templates
Migration workflows (for teams switching from tools like OpsGenie - start with the OpsGenie shutdown guide, then use our migration guide)

That's it for now. We'd rather have 20 tools that work than 70 that look good in a README.

Common questions

What about write safety?

Tools that send real notifications (like page_someone and escalate_incident) are clearly marked as destructive in their descriptions, so the agent knows to confirm before firing them. API keys are scoped, so you can give a key read-only access if you want.

Can I self-host it?

The MCP server runs locally via stdio (default) or as an HTTP server you deploy yourself. There's a Dockerfile included. The server calls Runframe's API, so your data stays in Runframe. The MCP server doesn't store anything.

Is there an HTTP transport for CI/CD pipelines?

Yes. Run with --transport http --port 3100. It takes a bearer token for auth, supports multiple clients, and is stateless so you can load-balance it.

Your agent is already in the IDE. Now it has an incident management layer that keeps up.

For a deeper look at what changes when agents handle coordination and where humans still need to draw the line, read Your AI Agent Just Handled That Incident. Now What?

Get started →

Your Agent Can Manage Incidents Now

What's in the box

How an agent runs an incident

Why we kept the tool set small

MCP works when you design for agents

Getting started

What's next

Common questions

Share this article

Related Articles

Alert Fatigue: Causes, Examples, and How to Reduce It

Your AI Agent Just Handled That Incident. Now What?

OpsGenie End of Life 2027: Support End Date

Your AI agent already knows your system better than ours ever will

Incident management for early-stage engineering teams

Best OpsGenie Alternatives in 2026: What Teams Actually Switch To

Build, Open Source, or Buy Incident Management in 2026

Slack Incident Management: What Works and What Breaks

PagerDuty Alternatives 2026: Compare Costs and Features

Incident Communication Templates: 8 Copy-Paste Examples

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

Runbook vs Playbook: Differences, Examples & Templates

OpsGenie Shutdown 2027: The Complete Migration Guide

How to Reduce MTTR in 2026: The Coordination Framework

Incident Severity Levels: SEV0-SEV4 Matrix, Examples & Template

Incident Management vs Incident Response: What's the Difference?

State of Incident Management 2026: Toil Rose 30% Despite AI

Slack Incident Response Playbook: Roles, Scripts & Templates

On-Call Rotation Guide: Schedule Templates, Handoffs & Examples

Post-Incident Review Template: Free PIR & Postmortem Examples

Incident Coordination: Cut Context Switching, Fix Faster

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Compare Tools

Runframe vs PagerDuty

Runframe vs incident.io

Runframe vs Grafana OnCall

All Comparisons

Automate Your Incident Response