The pattern shows up everywhere. Startups with 15 people, scale-ups with 180, enterprise teams with dedicated SRE orgs.
The outage itself isn't the hard part. What's hard is the minute after the alert fires. Who's leading this? Which Slack channel? What do we tell support? The fix is usually quick. The coordination tax is what kills you.
A technical lead at a 60-person company put it to me this way:
"Our worst incidents aren't the ones that take longest to fix. They're the ones where communication breaks down. Three people debug the same thing, support doesn't know we're working on it, management asks for updates every ten minutes because they haven't heard anything."
Teams experimenting with IDE-native and workflow-native automation are starting to route that coordination work to agents. I'm seeing engineers get Datadog alerts in Cursor and the agent handles it: checks who's on call, acknowledges, pulls recent deploys, pages the right person, logs everything. The engineer never leaves the editor.
The coordination work still happens. Someone decided in advance what the rules were. (If you're still figuring out what that looks like as your team grows, our scaling guide covers the stages most teams go through.)
What agents do today
- Check who's on call
- Acknowledge incidents, start the SLA clock
- Pull context: deploys, logs, related incidents
- Page the right person, with full context
- Log everything to the timeline
- Escalate if no one responds
One engineer described their first agent-handled SEV2:
"I woke up to a full picture. Agent had pulled the deploy from 20 minutes ago, logged the spike, paged me. I fixed it in ten minutes instead of spending thirty figuring out what was happening."
The agent didn't fix anything. It removed the friction around figuring out what needed fixing.
I've seen the failures too. An agent pages the wrong person because two engineers have similar names. An agent classifies a database slowdown as SEV0 when it's a SEV2 because it doesn't understand customer impact.
Your new job: deciding delegation boundaries
An engineering manager told me:
"We have an on-call schedule in a Google Sheet. Problem is, nobody looks at it. When something breaks at 2 AM, everyone waits for someone else to speak up. You lose 20 minutes."
The agent doesn't wait. It doesn't wonder who's on call. It doesn't forget to log things.
But you need to answer this before the next incident: what can the agent do without asking?
Clear to delegate:
- Acknowledging incidents
- On-call lookup
- Timeline logging
- Context gathering
- Paging with full context
Keep for humans:
- Rollback decisions
- Customer comms
- Postmortem conclusions
- Declaring "we're resolved"
The gray zone:
- Escalation timing: pick a number and write it down
- Severity classification: agents triage, humans confirm for customer-facing issues
This is the job now. Deciding what gets delegated and what doesn't, under uncertainty, before you're in the middle of an incident.
What if your agent is better at coordination than your team?
It doesn't forget timelines. It doesn't wonder who's on call. It doesn't get flustered at 3 AM.
Your humans sometimes skip timelines because they're busy. They page the wrong people when rushing. They forget deploys.
One team told me the agent didn't improve their incidents. It exposed how broken the process already was. Nobody can remember ten steps at 2 AM. The process was asking humans to do machine work.
I see the same ritual everywhere: teams spend 30 minutes after every incident arguing about whose fault it was. Not what went wrong. Whose fault. That time adds zero value. The agent doesn't care about fault. It logs what happened and moves on.
Your process might have been designed for humans to fail at things machines are good at. That's worth fixing before the next incident.
Fewer tools, not more
The instinct is wrong. More tools does not mean better agents.
More tools means more ambiguity. Seventy tools means the agent has to choose between list_incidents, get_incidents, search_incidents, and query_incidents. Four tools that do the same thing with different names. Now the agent is guessing.
Narrower action surfaces make agents more dependable. This is why we kept Runframe's MCP server deliberately narrow: sixteen tools around incident workflows, not a generic admin surface. If it doesn't help run an incident, it's out. (We wrote about how the MCP server works in detail.)
A VP of Engineering told me:
"I opened the setup guide and it was 40+ pages. How many severity levels? What's our escalation policy? We're 30 people. I don't know. So I closed the tab."
Enterprise tools are comprehensive. Comprehensive means complex. And most teams sit in the gap: too big for spreadsheets, too small for ServiceNow. (Build or buy covers this decision in more depth.)
When the agent beats the script
One team had a 380-line Python script running their incident flow. No comments. Variables like ch_id and usr_grp_2. The person who wrote it had left six months ago. Last month it created 11 channels for the same incident. Nobody touched it because everyone was scared of the code.
They replaced it with an agent. The incidents got manageable. More importantly, they finally saw how broken the underlying process was. They'd been hiding behind the script.
So what do you do?
Start with low-severity incidents. Let agents handle SEV3s. Don't touch SEV0s until you've watched it work a dozen times.
Write down the delegation boundaries. Not runbooks that say "do X then Y," but guardrails: never do X without human approval, escalate after Y minutes, Z requires sign-off. Test those boundaries before you need them.
Use agent failures to find process holes. Every time an agent does something unexpected, you learn something about what you hadn't defined.
Your job is designing the boundaries that decide what the system can and shouldn't do.
If you haven't defined the boundary, the agent will.
Runframe is the incident management platform we built around these ideas. runframe.io