MONITORING & OBSERVABILITY

The Four Golden Signals

The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.

Four Golden Signals

The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.

"The Vital Signs"

Just as a doctor checks heart rate and blood pressure, an SRE checks the Four Golden Signals.

1. Latency

The time it takes to service a request.

  • Tip: Distinguish between latency for successful requests and latency for failed requests; mixing them can hide what users actually experience.

2. Traffic

A measure of how much demand is being placed on your system.

  • Web: Requests per second (RPS).
  • Audio: Concurrent streams.

3. Errors

The rate of requests that fail.

  • Explicit: HTTP 500s.
  • Implicit: HTTP 200s with "Success: False" body (content errors).

4. Saturation

How "full" your service is.

  • CPU usage, Memory, Disk I/O.
  • Many systems degrade before they reach 100% utilization, so watch constrained resources and set utilization targets before latency spikes.

ExThe Slow Disk

A service was slow, but CPU and Memory were low. No errors were firing.

Impact
Latency increased from 100ms to 2s.
Resolution
The team checked "Saturation" and found Disk I/O was at 100%. A logging process was spamming the disk. They throttled the logger, and latency recovered.

Why Four Golden Signals Matters

Standardized by Google SRE, these signals give you a high-level view of any system's health.

Monitoring these four signals gives broad coverage for user-facing incidents, especially when alerts focus on symptoms and imminent saturation.

Common Pitfalls

Averages
Don't monitor "Average Latency". Monitor "p99 Latency". Averages hide outliers.

How to Use Four Golden Signals

⏱️
Latency: Time it takes to service a request.
🚦
Traffic: Demand on your system (e.g., RPS).
Errors: Rate of failed requests.

Frequently Asked Questions

Put this into practice.