CONCEPTS
Incident Automation
The use of technology to perform tasks with reduced human assistance.
Incident Automation
The use of technology to perform tasks with reduced human assistance.
"Robots Do It Better"
Automation is the application of code to remove manual effort. In Incident Management, it saves minutes when every second counts.
What to Automate
- Detection: Alerts (obviously).
- Diagnostics: Scripts that run automatically when an alert fires to gather logs/graphs.
- Remediation: Auto-restarting bad pods, auto-scaling clusters.
- Administration: Creating Slack channels, Jira tickets, and Zoom links.
The Automation Paradox
Automation saves time, but it takes time to build. You must weigh the "Return on Investment" (ROI). If a task takes 1 minute and happens once a year, don't spend 2 weeks automating it.
ExThe Slack Bot
"Setting up a war room (invite people, create doc, create channel) took 15 minutes of clicking."
Impact
Wasted time during SEV1s.
Resolution
Team built a `/incident start` bot. It does all 15 minutes of work in 5 seconds.
Why Incident Automation Matters
Automation reduces toil, minimizes human error, and speeds up incident response.
The goal of SRE is to automate this year's job away.
Common Pitfalls
Automating Broken Processes
If your manual process is bad, your automated process will just be bad faster. Simplify the process first.
Frequently Asked Questions
Can automation be dangerous?
Yes. "Automated Remediation" (e.g., auto-restart) can create "Fail Loops" if not careful. Always have a kill switch.