LEARNING RESOURCES

Learn DevOps &
SRE Concepts.

Essential incident management and reliability engineering terms explained. From MTTR to SLOs, master the concepts that power elite engineering teams.

Complete Glossary

41 terms across 5 categories

Concepts(17 terms)

Alert Fatigue
Alert Fatigue

Desensitization caused by excessive or low-quality alerts, leading to missed critical alerts.

Toil
Toil

Operational work that tends to be manual, repetitive, and automatable.

SLO
Service Level Objective

A target reliability threshold for a service, typically expressed as a percentage over a time period.

SLA
Service Level Agreement

A contractual commitment to customers regarding service performance and availability.

SLI
Service Level Indicator

A measurable metric that indicates service performance, used to track SLOs.

Error Budget
Error Budget

The amount of unreliability a service can have before violating its SLO.

Chaos Engineering
Chaos Engineering

The practice of intentionally injecting failures into systems to build resilience.

Game Day
Game Day

A scheduled practice session where teams simulate incidents to test response procedures.

War Room
War Room

A dedicated physical or virtual space where incident responders coordinate during major incidents.

Observability
Observability

The measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Monitoring
Monitoring

The process of collecting, analyzing, and using data to track the health of applications and infrastructure.

Reliability
Reliability

The probability that a system will function correctly under stated conditions for a specified period.

Availability
Availability

The proportion of time a system is operational and accessible.

Incident Automation
Incident Automation

The use of technology to perform tasks with reduced human assistance.

DevOps
DevOps

A cultural philosophy that combines software development (Dev) and IT operations (Ops) to shorten the systems development life cycle.

Status Page
Status Page

A public or private dashboard that communicates the current health of services to users.

Four Golden Signals
The Four Golden Signals

The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.

Free SRE Tools

Calculate MTTR, generate severity matrices, and more.

Explore Tools →

Industry Research

Read the State of Incident Management 2025 report.

Read Report →