Learn DevOps &
SRE Concepts.
Essential incident management and reliability engineering terms explained. From MTTR to SLOs, master the concepts that power elite engineering teams.
Complete Glossary
41 terms across 5 categories
Core Metrics(4 terms)
The average time it takes to fully resolve an incident from detection to service restoration.
The average time from when an alert fires to when a human acknowledges it.
The average time from when an issue occurs to when an alert fires.
The average time between system failures or incidents.
Roles(6 terms)
The person responsible for all high-level coordination and decision-making during an incident.
The person responsible for all internal and external communication during an incident.
The person responsible for accurately documenting the timeline, actions, and decisions during an incident.
The technical specialist responsible for diagnosing and fixing the specific service or component causing the incident.
A schedule that determines which engineer is responsible for answering alerts during a specific time period.
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
Severity(5 terms)
Critical emergency. The system is completely unusable for a significant portion of users, or data integrity is at risk.
Major incident. Significant functionality is broken or degraded, but a workaround may exist or the impact is partial.
Degraded performance or minor functionality broken for some users. Workarounds may exist.
Minor bug or confusing non-critical issue. No immediate user impact.
Cosmetic issues, typos, or internal-only problems. Zero user impact.
Processes(9 terms)
The process of detecting, responding to, and resolving system incidents or outages.
The complete lifecycle of how organizations prevent, detect, respond to, and learn from incidents.
A meeting to analyze what happened during an incident and identify improvements.
A post-incident analysis that focuses on system and process failures rather than individual blame.
A step-by-step guide for handling specific operational tasks or incidents.
A comprehensive guide containing strategies, procedures, and best practices for handling scenarios.
Predefined rules for when and how to escalate incidents to additional resources or management.
A systematic method for identifying the underlying causes of problems or incidents.
A collaboration model that connects people, tools, and scripts into a transparent workflow (usually Slack/Teams).
Concepts(17 terms)
Desensitization caused by excessive or low-quality alerts, leading to missed critical alerts.
Operational work that tends to be manual, repetitive, and automatable.
A target reliability threshold for a service, typically expressed as a percentage over a time period.
A contractual commitment to customers regarding service performance and availability.
A measurable metric that indicates service performance, used to track SLOs.
The amount of unreliability a service can have before violating its SLO.
The practice of intentionally injecting failures into systems to build resilience.
A scheduled practice session where teams simulate incidents to test response procedures.
A dedicated physical or virtual space where incident responders coordinate during major incidents.
The measure of how well internal states of a system can be inferred from knowledge of its external outputs.
The process of collecting, analyzing, and using data to track the health of applications and infrastructure.
The probability that a system will function correctly under stated conditions for a specified period.
The proportion of time a system is operational and accessible.
The use of technology to perform tasks with reduced human assistance.
A cultural philosophy that combines software development (Dev) and IT operations (Ops) to shorten the systems development life cycle.
A public or private dashboard that communicates the current health of services to users.
The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.