MONITORING & OBSERVABILITY

Service Level Agreement

A contractual commitment to customers regarding service performance and availability.

SLA = Uptime Target (e.g., 99.9%)

A contractual commitment to customers regarding service performance and availability.

The "Contract"

SLA (Service Level Agreement) is a business/legal contract. It says: "If we are not reliable, we will pay you money."

SLA vs. SLO

  • SLO (Internal): "We want to be up 99.9% of the time." (Result: Team unhappy if missed).
  • SLA (External): "We promise to be up 99.5% of the time." (Result: Lawyers involved if missed).

The "Buffer Zone"

Wise teams set their SLO higher than their SLA.

  • SLA: 99.5% (Allows 3.6 hours downtime/month).
  • SLO: 99.9% (Allows 43 mins downtime/month).
  • Gap: This buffer ensures you get alerted and fix issues before you owe customers a refund.

ExThe Cloud Provider SLA

AWS EC2 promises a monthly SLA of 99.99%.

Impact
If they fall below 99.0%, customers get a 30% service credit.
Resolution
This financial penalty forces AWS to invest heavily in redundancy and region isolation.

Why SLA Matters

SLAs create legal and financial obligations. Missing SLAs damages revenue and trust.

Internal SLOs should be stricter than customer SLAs to provide buffer.

SLA vs. Other Metrics

SLA
Customer promise
SLO
Internal target
SLI
Measured metric

Common Pitfalls

Promising 100%
Avoid promising "100%" in a contract unless Legal and Engineering have explicitly accepted the risk.
Internal SLAs
Use SLOs for internal teams. SLAs imply contractual penalties, which usually make more sense for customer commitments than internal reliability goals.

Related Terms

Frequently Asked Questions

Put this into practice.