Skip to content

Skill reference: sre-runbook

The sre-runbook skill authors one document genre: an SRE operational runbook — a tactical, step-by-step procedure for one specific alert. This reference describes what that document type is, how the skill produces one, when it earns its place, and the provenance and sources behind it.

PropertyValue
AuthorsA tactical step-by-step runbook for one alert
Purpose groupOperations
MIF conceptTypeprocedural
Target MIF level2
Primary sourceGoogle SRE Book — Table of Contents

A runbook is the document an on-call responder opens while the pager is going off. Its scope is deliberately narrow: ONE named alert or failure condition — a latency SLO burn, a queue backlog, replica lag — and the concrete steps to detect, diagnose, and remediate it under pressure. The Google SRE practice of on-call response (described in the chapter on Being On-Call) is built on exactly this idea: a responder under stress should follow a known, low-judgement procedure rather than improvise, because reliable recovery comes from rehearsed steps, not heroics. A good runbook therefore names the alert and its symptoms, states the triage checks in order, gives the remediation actions, and says when to escalate.

A runbook is not a strategic incident plan — coordinating a class of incidents across roles and phases is the job of a playbook. It is not a teaching lesson; a learner who needs to understand the system wants a tutorial, not a procedure to execute at 3am. Its value is precisely that it assumes competence and optimises for speed under load.

sre-runbook is a genre skill: it carries the runbook pattern as durable instructions plus exemplars, and writes the artifact over a MIF floor so the result is at once an actionable procedure and a machine-conformant unit.

  • Pattern, made operational. The skill encodes the tactical shape — one alert, ordered detect/diagnose/remediate steps, explicit escalation criteria — and refuses anti-triggered work (a class of incidents belongs in a playbook; a learning lesson belongs in a tutorial).
  • Written for the worst moment. Steps are concrete and verifiable so a tired responder can follow them exactly; each diagnostic step leads to a decision and the next action rather than open-ended investigation.
  • Exemplars set the bar. Like every genre in the suite it ships good-l1.md (the MIF Level-1 floor), good.md (the target level — Level 2 here), bad.md (a counter-example), and evals/evals.json. The check-exemplars gate proves good-l1.md validates at L1 and good.md at its target level.
  • MIF projection. The document is authored with MIF frontmatter (via the shared mif-frontmatter substrate) and a conceptType of procedural, reflecting that a runbook is a sequence of performed steps. mif-validate proves the Markdown to JSON-LD round-trip is lossless before the document is considered done.

Reach for sre-runbook when you have a specific alert that pages humans and you want the response to be fast, consistent, and independent of who happens to be on call. Its value compounds over time: every incident is a chance to refine the runbook so the next responder is faster, which is the core of the on-call discipline.

Do not use it for cross-team coordination of a major incident — that is a playbook, which works at a higher altitude across roles and phases. Do not use it to teach the system or to record a decision. The cost of a runbook is freshness: a procedure that drifts from reality is worse than none, so it must be revised whenever the system or the alert changes.

A runbook titled “Checkout latency SLO burn” names the alert and its firing threshold, then walks ordered steps — confirm the burn on the dashboard, check the dependency health panel, identify whether the cause is the database or a downstream service, apply the matching remediation (shed load, fail over the replica), and verify the SLO recovers — closing with the condition that escalates to a playbook-coordinated incident if remediation does not hold.