Learn DevOps & SRE Concepts.

Essential incident management and reliability engineering terms explained. From MTTR to SLOs, master the concepts that power elite engineering teams.

Explore Tools Read the Blog

Complete Glossary

57 terms across 9 categories

Jump to category

Core Metrics Roles Severity Processes On-Call Management Monitoring & Observability Incident Response Engineering Practices Health & Well-being

Core Metrics(9 terms)

MTTR

Mean Time To Resolution

The average time it takes to fully resolve an incident from detection to service restoration.

MTTA

Mean Time To Acknowledge

How fast your team responds when something breaks. The clock starts when an alert fires and stops when someone takes ownership.

MTTD

Mean Time To Detect

The average time from when an issue occurs to when an alert fires.

MTBF

Mean Time Between Failures

The average time between system failures or incidents.

SLO

Service Level Objective

A target reliability threshold for a service, typically expressed as a percentage over a time period.

SLI

Service Level Indicator

A measurable metric that indicates service performance, used to track SLOs.

Error Budget

The amount of unreliability a service can have before violating its SLO.

Downtime

System Downtime

The period of time during which a system or service is unavailable or failing to perform its primary function.

Uptime

System Uptime

The percentage of time that a system is fully operational and available to users.

Roles(5 terms)

Incident Commander

Incident Commander (IC)

The person responsible for all high-level coordination and decision-making during an incident.

Communication Lead

Communication Lead (Comms)

The person responsible for all internal and external communication during an incident.

Incident Scribe

Incident Scribe (Scribe)

The person responsible for accurately documenting the timeline, actions, and decisions during an incident.

Subject Matter Expert

Subject Matter Expert (SME)

The technical specialist responsible for diagnosing and fixing the specific service or component causing the incident.

SRE

Site Reliability Engineering

A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

Severity(7 terms)

Incident Severity Matrix

Incident Severity Matrix (SEV/P0 Format)

A standardized framework used to classify the impact and urgency of an incident ensuring the right response team is engaged at the right time.

SEV0

Severity 0 (Critical)

Critical emergency. The system is completely unusable for a significant portion of users, or data integrity is at risk.

SEV1

Severity 1 (High)

Major incident. Significant functionality is broken or degraded, but a workaround may exist or the impact is partial.

SEV2

Severity 2 (Medium)

Degraded performance or minor functionality broken for some users. Workarounds may exist.

SEV3

Severity 3 (Low)

Minor bug or confusing non-critical issue. No immediate user impact.

SEV4

Severity 4 (Trivial)

Cosmetic issues, typos, or internal-only problems. Zero user impact.

Priority

Incident Priority Levels: P0-P4 and SEV0-SEV4

A classification system (P0-P4 or SEV0-SEV4) that determines the urgency and response time required for an incident.

Processes(7 terms)

Incident Response

The process of detecting, responding to, and resolving system incidents or outages.

Post-Incident Review

A meeting to analyze what happened during an incident and identify improvements.

Blameless Postmortem

A post-incident analysis that focuses on system and process failures rather than individual blame.

Runbook

A step-by-step guide for handling specific operational tasks or incidents.

Playbook

A comprehensive guide containing strategies, procedures, and best practices for handling scenarios.

Escalation Policy

Predefined rules for when and how to escalate incidents to additional resources or management.

Incident Automation

The use of technology to perform tasks with reduced human assistance.

On-Call Management(6 terms)

Root Cause Analysis

A systematic method for identifying the underlying causes of problems or incidents.

On-Call

The practice of designating specific team members to be available to respond to urgent system issues outside of standard working hours.

On-Call Schedule

On-Call Rotations & Schedules

A roster that determines which engineer is responsible for responding to incidents at any given time.

On-Call Responder

The engineer currently designated to receive and act upon system alerts and incidents.

Handoff

On-Call Handoff

The structured process of transferring incident context, active alerts, and duties from one on-call engineer to the next.

Follow-the-Sun

Follow-the-Sun Rotation

A global on-call scheduling model where shifts are assigned to teams in active time zones to avoid night shifts.

Monitoring & Observability(5 terms)

SLA

Service Level Agreement

A contractual commitment to customers regarding service performance and availability.

Observability

The measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Monitoring

The process of collecting, analyzing, and using data to track the health of applications and infrastructure.

Availability

The proportion of time a system is operational and accessible.

Four Golden Signals

The Four Golden Signals

The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.

Incident Response(8 terms)

Incident Management

The complete lifecycle of how organizations prevent, detect, respond to, and learn from incidents.

On-Call Rotation

A schedule that determines which engineer is responsible for answering alerts during a specific time period.

Chaos Engineering

The practice of intentionally injecting failures into systems to build resilience.

Game Day

A scheduled practice session where teams simulate incidents to test response procedures.

War Room

A dedicated physical or virtual space where incident responders coordinate during major incidents.

Triage

Incident Triage

The initial phase of incident response where the severity, impact, and required expertise are determined.

Incident Lifecycle

The Incident Management Lifecycle

The end-to-end journey of an incident from the moment it occurs until the post-incident review is completed.

Resolution

Incident Resolution

The point in the incident lifecycle where the service is restored to full functionality for the customer.

Engineering Practices(5 terms)

Toil

Operational work that tends to be manual, repetitive, and automatable.

Reliability

The probability that a system will function correctly under stated conditions for a specified period.

DevOps

A cultural philosophy that combines software development (Dev) and IT operations (Ops) to shorten the systems development life cycle.

Status Page

A public or private dashboard that communicates the current health of services to users.

ChatOps

A collaboration model that connects people, tools, and scripts into a transparent workflow (usually Slack/Teams).

Health & Well-being(5 terms)

Alert Fatigue

Desensitization caused by excessive or low-quality alerts, leading to missed critical alerts.

On-Call Burnout

Physical and emotional exhaustion caused by frequent sleep interruption, excessive alerts, and the stress of maintaining high availability.

Sustainable On-Call

Sustainable On-Call Culture

An on-call practice that prioritizes human health and long-term team viability alongside system reliability.

Fair Rotations

Fair On-Call Rotations

A scheduling principle ensuring that the burden of on-call duties (including weekends and holidays) is distributed equitably across the team.

On-Call Load Distribution

The practice of measuring and equalizing the amount of time and effort each team member spends on on-call duties.

Ready to put this knowledge to work?

Start managing incidents with the concepts you just learned.

Start Free Explore Tools