Skill reference: sre-runbook
Skill reference: sre-runbook
Section titled “Skill reference: sre-runbook”The sre-runbook skill authors one document genre: an SRE operational
runbook — a tactical, step-by-step procedure for one specific alert. This
reference describes what that document type is, how the skill produces one, when
it earns its place, and the provenance and sources behind it.
| Property | Value |
|---|---|
| Authors | A tactical step-by-step runbook for one alert |
| Purpose group | Operations |
MIF conceptType | procedural |
| Target MIF level | 2 |
| Primary source | Google SRE Book — Table of Contents |
What this document type is
Section titled “What this document type is”A runbook is the document an on-call responder opens while the pager is going off. Its scope is deliberately narrow: ONE named alert or failure condition — a latency SLO burn, a queue backlog, replica lag — and the concrete steps to detect, diagnose, and remediate it under pressure. The Google SRE practice of on-call response (described in the chapter on Being On-Call) is built on exactly this idea: a responder under stress should follow a known, low-judgement procedure rather than improvise, because reliable recovery comes from rehearsed steps, not heroics. A good runbook therefore names the alert and its symptoms, states the triage checks in order, gives the remediation actions, and says when to escalate.
A runbook is not a strategic incident plan — coordinating a class of incidents across roles and phases is the job of a playbook. It is not a teaching lesson; a learner who needs to understand the system wants a tutorial, not a procedure to execute at 3am. Its value is precisely that it assumes competence and optimises for speed under load.
How the skill produces one
Section titled “How the skill produces one”sre-runbook is a genre skill: it carries the runbook pattern as durable
instructions plus exemplars, and writes the artifact over a MIF floor so the
result is at once an actionable procedure and a machine-conformant unit.
- Pattern, made operational. The skill encodes the tactical shape — one
alert, ordered detect/diagnose/remediate steps, explicit escalation criteria —
and refuses anti-triggered work (a class of incidents belongs in a
playbook; a learning lesson belongs in a tutorial). - Written for the worst moment. Steps are concrete and verifiable so a tired responder can follow them exactly; each diagnostic step leads to a decision and the next action rather than open-ended investigation.
- Exemplars set the bar. Like every genre in the suite it ships
good-l1.md(the MIF Level-1 floor),good.md(the target level — Level 2 here),bad.md(a counter-example), andevals/evals.json. Thecheck-exemplarsgate provesgood-l1.mdvalidates at L1 andgood.mdat its target level. - MIF projection. The document is authored with MIF frontmatter (via the
shared
mif-frontmattersubstrate) and aconceptTypeofprocedural, reflecting that a runbook is a sequence of performed steps.mif-validateproves the Markdown to JSON-LD round-trip is lossless before the document is considered done.
When it is beneficial
Section titled “When it is beneficial”Reach for sre-runbook when you have a specific alert that pages humans and
you want the response to be fast, consistent, and independent of who happens to be
on call. Its value compounds over time: every incident is a chance to refine the
runbook so the next responder is faster, which is the core of the on-call
discipline.
Do not use it for cross-team coordination of a major incident — that is a playbook, which works at a higher altitude across roles and phases. Do not use it to teach the system or to record a decision. The cost of a runbook is freshness: a procedure that drifts from reality is worse than none, so it must be revised whenever the system or the alert changes.
Example
Section titled “Example”A runbook titled “Checkout latency SLO burn” names the alert and its firing threshold, then walks ordered steps — confirm the burn on the dashboard, check the dependency health panel, identify whether the cause is the database or a downstream service, apply the matching remediation (shed load, fail over the replica), and verify the SLO recovers — closing with the condition that escalates to a playbook-coordinated incident if remediation does not hold.
Provenance & citations
Section titled “Provenance & citations”- Genre source — Google SRE on-call practice: the runbook genre follows the on-call response discipline described in the Google SRE Book, https://sre.google/sre-book/table-of-contents/ (see the chapter on Being On-Call).
- Skill provenance: authored by the
sre-runbookskill in the mif-docs plugin, https://github.com/modeled-information-format/mif-docs-plugin; the skill’s exemplars andevals/define and verify the pattern. - MIF conformance: the document projects to canonical JSON-LD under the MIF
specification, https://mif-spec.dev, and is proven lossless by
mif-validate. - Index: this skill is one entry in the skills by purpose catalog; its sibling operations genre is the strategic playbook.