Learn/Subject Matter Expert
ROLES

Subject Matter Expert (SME)

The technical specialist responsible for diagnosing and fixing the specific service or component causing the incident.

Subject Matter Expert

The technical specialist responsible for diagnosing and fixing the specific service or component causing the incident.

The Hands on the Keyboard

If the Incident Commander is the conductor, the Subject Matter Expert (SME) is the soloist. They are the engineer who knows exactly why the Redis queue is backing up or why the load balancer is throwing 502s.

Responsibilities

  • Diagnosis: finding the root cause.
  • Mitigation: stopping the bleeding (e.g., adding capacity, rolling back).
  • Communication: telling the IC what they are doing before they do it.

How to be a Great SME

The best SMEs don't just fix things; they communicate.

  • Bad SME: Goes silent for 20 minutes, then says "Fixed it."
  • Good SME: "I suspect a bad migration. I am going to verify the schema version. (5 mins later) Verified. I requested permission to rollback."

One Area of Focus

An SME should focus on one thing. If the incident spans multiple services (e.g., Database + Frontend), you need multiple SMEs. The IC coordinates them; the SMEs fix the systems.

ExThe Hero Trap

"A senior engineer tried to fix a complex cascading failure alone. They worked for 12 hours straight without sleep."

Impact
Fatigue led to a typo that deleted the backups. The outage was extended by 24 hours.
Resolution
The team implemented a rule: SMEs must rotate every 4 hours during SEV1s. No heroes allowed.

Why Subject Matter Expert Matters

SMEs have the deep context that the IC lacks.

They are the only ones allowed to touch production during an incident.

They translate "it's broken" into "the connection pool is exhausted".

Common Pitfalls

Rabbit Holes
Spending 2 hours investigating a "weird log" that is unrelated to the outage.
Silent Fixing
Fixing the issue without telling anyone, so no one knows *why* it recovered.
Panic Typing
Running commands without checking the runbook first.

How to Use Subject Matter Expert

🧠
Think Aloud: Narrate your actions so the Scribe can record them.
🛑
Ask for Permission: Before running a destructive command, clear it with the IC.
🔄
Hand Off: If you are stuck for 30 mins, swap with another SME.

Frequently Asked Questions

Can the IC override the SME?
Yes. The IC decides *strategy* (e.g., "prioritize data safety over uptime"), but they usually defer to the SME on *tactics* ("how to query the DB").
Do I need to be a Senior Engineer to be an SME?
No. You just need to know the service better than anyone else on the call.
What if I don't know the answer?
Say "I don't know, I need help." The IC will find you backup. Hiding ignorance causes extended outages.

Learn More