Overview
This course anchors you in real cluster operations: etcd awareness, static pod recovery, networking faults, and scheduler edge cases. Each module pairs concise reference notes with timed labs on shared infrastructure that mirrors production constraints. You will document every remediation so the habits transfer directly to on-call work.
What you work through
- Live breaker-box scenarios with guided recovery checklists
- NetworkPolicy and Service debugging across dual-stack clusters
- Upgrade rehearsal paths with cordon, drain, and rollback drills
- Storage class triage for RWO, RWX, and snapshot restore flows
- Observability hooks: kubelet logs, events, and audit signal triage
- Performance tuning for API server flags and etcd latency
- Exam-style timeboxes with annotated model answers
Outcomes
- Complete a full cluster rescue path without reference cards
- Explain trade-offs for etcd backup frequency in regulated teams
- Ship a personal runbook covering upgrades and failure modes
Facilitator
Haneul Park
Lead Kubernetes Instructor with platform engineering background in regulated finance clusters.
Participant notes
The static pod recovery lab finally made kubelet behavior click. I still keep the breaker-box checklist in our incident channel.
Clear pacing, though the storage module expects you to self-drive a bit. Mentor annotations on my kubectl traces were the standout.
Course questions
Yes. Each learner receives an isolated namespace with cluster-admin scoped tasks. We do not grant unmanaged cluster-admin on shared etcd-backed environments.
We cover the published domains, but we add resilience drills that go beyond the minimum. Some topics appear in greater depth than the exam strictly requires.
Exam registration fees, travel for any optional in-person reviews, and third-party monitoring tools beyond the bundled lab stack.