Program cover art for Operations Under Load

Overview

You rotate through incident commander and note-taker roles while instructors inject node cordons and flaky endpoints. The goal is confident language during outages, not heroics.

What is included

Rotating incident roles with timed injects
Live traffic replay against sample microservices
Post-incident template aligned to quality standards reviewers expect
Capacity signal worksheet (latency, saturation, errors)
Pair debugging on kubelet logs
Warm handoff script for daytime crews
Quiet-room option for reflection after heavy drills

Outcomes

Facilitate a fifteen-minute stabilization huddle
Choose between surge upgrade paths with tradeoffs spelled out
Capture evidence an external reviewer can follow

Lead instructor for this track

Marcus Webb

SRE practice coach; collects retro formats from aviation and theater crews.

FAQ

Will we destroy shared infrastructure?

Faults are scoped to disposable namespaces. If a drill escapes, we snapshot state and rebuild—participants never owe infra repair hours.

Can we bring our runbooks?

Please do. We annotate them together so the language matches your internal quality standards.

Limitations?

We do not simulate multi-region control plane loss simultaneously; that is reserved for private workshops.

Recent learner notes

I liked that the injects felt petty-real—slow image pulls, not cartoonish fire drills.

— Sora