ROLES
Site Reliability Engineering
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
SRE
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
"Hope is not a Strategy"
Site Reliability Engineering (SRE) is what happens when you ask a software engineer to design an operations function. Coined by Google, it replaces the traditional "SysAdmin" model with a focus on automation, reliability, and scale.
Core Principles
- Embrace Risk: 100% uptime is too expensive and prevents innovation. Use Error Budgets to balance speed and stability.
- Eliminate Toil: If a human has to push a button 10 times a day, write a script to do it.
- Monitor Distributed Systems: Focus on symptoms (latency, errors) rather than potential causes (CPU usage).
SRE vs. DevOps
- DevOps is the culture/philosophy ("Break down silos").
- SRE is the implementation ("Here is how we measure reliability").
- Analogy: DevOps is "Chemistry"; SRE is "Chemical Engineering."
ExThe Manual Launch
"A team manually deployed code to 500 servers. It took 6 hours and 3 people."
Impact
Deployments were so painful they only happened once a month. Features rotted.
Resolution
An SRE wrote a deployment pipeline. Deploys now take 10 minutes and happen 20 times a day automatically.
Why SRE Matters
SRE bridges the gap between development and operations.
SRE principles (SLOs, Error Budgets, Blameless Postmortems) define modern incident management.
Common Pitfalls
Rebranding Ops
Renaming your Ops team to "SRE" without changing *how* they work (still doing manual tickets).
Ignoring Error Budgets
Tracking SLOs but ignoring them when they burn down.
Gatekeeping
SREs becoming the "Department of No" instead of enabling developers.
How to Use SRE
📉
Reduce Toil: Automate repetitive work to free up engineering time.
🎯
Start with SLOs: Define reliability goals before building solutions.
🤝
Blameless Culture: Focus on system failure, not human error.
Related Terms
Frequently Asked Questions
Do I need a dedicated SRE team?
Not initially. You can practice SRE principles (SLOs, automation) within your product teams.
Is SRE just a new name for Ops?
No. Traditional Ops closes tickets. SREs write code to close tickets forever.
What is Toil?
Manual, repetitive, automatable work. SREs aim to cap toil at 50% of their time.