CONCEPTS
The Four Golden Signals
The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.
Four Golden Signals
The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.
"The Vital Signs"
Just as a doctor checks heart rate and blood pressure, an SRE checks the Four Golden Signals.
1. Latency
The time it takes to service a request.
- Tip: Distinguish between success latency (fast) and error latency (could be very fast or very slow).
2. Traffic
A measure of how much demand is being placed on your system.
- Web: Requests per second (RPS).
- Audio: Concurrent streams.
3. Errors
The rate of requests that fail.
- Explicit: HTTP 500s.
- Implicit: HTTP 200s with "Success: False" body (content errors).
4. Saturation
How "full" your service is.
- CPU usage, Memory, Disk I/O.
- Once saturation hits 100%, performance degrades rapidly (latency spikes).
ExThe Slow Disk
"A service was slow, but CPU and Memory were low. No errors were firing."
Impact
Latency increased from 100ms to 2s.
Resolution
The team checked "Saturation" and found Disk I/O was at 100%. A logging process was spamming the disk. They throttled the logger, and latency recovered.
Why Four Golden Signals Matters
Standardized by Google SRE, these signals give you a high-level view of any system's health.
Monitoring these four signals is often enough to detect most user-facing incidents.
Common Pitfalls
Averages
Don't monitor "Average Latency". Monitor "p99 Latency". Averages hide outliers.
How to Use Four Golden Signals
⏱️
Latency: Time it takes to service a request.
🚦
Traffic: Demand on your system (e.g., RPS).
❌
Errors: Rate of unused failed requests.
Related Terms
Frequently Asked Questions
Which is most important?
Errors. If users are seeing errors, nothing else matters. Fix that first.