Program cover art for Signals, Traces, and Noise Control

Overview

Monitoring and troubleshooting week teaches you to choose a small signal set, propagate trace context responsibly, and narrate graphs during incidents without drowning teammates in panels.

What is included

Prometheus scrape hygiene exercises
Trace sampling tradeoff worksheet
Log volume budgeting with sidecar pitfalls called out
Dashboard critique studio
Live tail pairing on structured logs
Runbook snippet library for common kube-state signals
Optional night lab for night-shift engineers

Outcomes

Propose three golden signals for a sample service
Demonstrate trace correlation across two services
Trim redundant alerts from a starter rules file

Lead instructor for this track

Elias Romero

Ex-observability vendor educator; now allergic to chart sprawl.

FAQ

Which stacks are installed?

Prometheus, Grafana, and OpenTelemetry collectors—versions pinned per cohort announcement.

Can we bring proprietary agents?

Not into shared clusters; talk to us about a private lab build if you need that fidelity.

What is out of scope?

We do not tune enterprise appliance appliances; focus stays on Kubernetes-native telemetry.

Recent learner notes

Finally someone said aloud that half our dashboards were vanity.

— Noah , On-call lead