Skill reference: sre-runbook

Skill reference: `sre-runbook`

The sre-runbook skill authors one document genre: an SRE operational runbook — a tactical, step-by-step procedure for one specific alert. This reference describes what that document type is, how the skill produces one, when it earns its place, and the provenance and sources behind it.

Property	Value
Authors	A tactical step-by-step runbook for one alert
Purpose group	Operations
MIF `conceptType`	`procedural`
Target MIF level	2
Primary source	Google SRE Book — Table of Contents

What this document type is

A runbook is the document an on-call responder opens while the pager is going off. Its scope is deliberately narrow: ONE named alert or failure condition — a latency SLO burn, a queue backlog, replica lag — and the concrete steps to detect, diagnose, and remediate it under pressure. The Google SRE practice of on-call response (described in the chapter on Being On-Call) is built on exactly this idea: a responder under stress should follow a known, low-judgement procedure rather than improvise, because reliable recovery comes from rehearsed steps, not heroics. A good runbook therefore names the alert and its symptoms, states the triage checks in order, gives the remediation actions, and says when to escalate.

A runbook is not a strategic incident plan — coordinating a class of incidents across roles and phases is the job of a playbook. It is not a teaching lesson; a learner who needs to understand the system wants a tutorial, not a procedure to execute at 3am. Its value is precisely that it assumes competence and optimises for speed under load.

How the skill produces one

sre-runbook is a genre skill: it carries the runbook pattern as durable instructions plus exemplars, and writes the artifact over a MIF floor so the result is at once an actionable procedure and a machine-conformant unit.

Pattern, made operational. The skill encodes the tactical shape — one alert, ordered detect/diagnose/remediate steps, explicit escalation criteria — and refuses anti-triggered work (a class of incidents belongs in a playbook; a learning lesson belongs in a tutorial).
Written for the worst moment. Steps are concrete and verifiable so a tired responder can follow them exactly; each diagnostic step leads to a decision and the next action rather than open-ended investigation.
Exemplars set the bar. Like every genre in the suite it ships good-l1.md (the MIF Level-1 floor), good.md (the target level — Level 2 here), bad.md (a counter-example), and evals/evals.json. The check-exemplars gate proves good-l1.md validates at L1 and good.md at its target level.
MIF projection. The document is authored with MIF frontmatter (via the shared mif-frontmatter substrate) and a conceptType of procedural, reflecting that a runbook is a sequence of performed steps. mif-validate proves the Markdown to JSON-LD round-trip is lossless before the document is considered done.

When it is beneficial

Reach for sre-runbook when you have a specific alert that pages humans and you want the response to be fast, consistent, and independent of who happens to be on call. Its value compounds over time: every incident is a chance to refine the runbook so the next responder is faster, which is the core of the on-call discipline.

Do not use it for cross-team coordination of a major incident — that is a playbook, which works at a higher altitude across roles and phases. Do not use it to teach the system or to record a decision. The cost of a runbook is freshness: a procedure that drifts from reality is worse than none, so it must be revised whenever the system or the alert changes.

Example

A runbook titled “Checkout latency SLO burn” names the alert and its firing threshold, then walks ordered steps — confirm the burn on the dashboard, check the dependency health panel, identify whether the cause is the database or a downstream service, apply the matching remediation (shed load, fail over the replica), and verify the SLO recovers — closing with the condition that escalates to a playbook-coordinated incident if remediation does not hold.

Provenance & citations

Genre source — Google SRE on-call practice: the runbook genre follows the on-call response discipline described in the Google SRE Book, https://sre.google/sre-book/table-of-contents/ (see the chapter on Being On-Call).
Skill provenance: authored by the sre-runbook skill in the mif-docs plugin, https://github.com/modeled-information-format/mif-docs-plugin; the skill’s exemplars and evals/ define and verify the pattern.
MIF conformance: the document projects to canonical JSON-LD under the MIF specification, https://mif-spec.dev, and is proven lossless by mif-validate.
Index: this skill is one entry in the skills by purpose catalog; its sibling operations genre is the strategic playbook.