Operational playbooks для PAGE alerts (24/7 escalation, master spec §II.6).
Цель runbook'а: on-call за 15 минут (ack window) знает что делать, не зная архитектуры.
Содержание
| File | Alert symptom | Severity |
|---|---|---|
| health_down.md | /health returns non-200 OR no response | PAGE |
| database_unreachable.md | Neon Postgres connection errors | PAGE |
| liveblocks_outage.md | Liveblocks API errors > 5% OR connection refused | PAGE |
| webhook_signature_spike.md | 401 на /webhooks/github > 10/min (potential attack) | PAGE |
| url_import_ssrf_spike.md | SSRF guard rejections > 10/min | INVESTIGATE |
| url_import_deployment.md | (operational guide, не alert) | OPERATIONAL |
Quarterly review
Per master spec §II.9 — каждый quarter:
- Walk through каждого runbook'а
- Verify ссылки + URLs actual
- Remove resolved gotchas
- Add new failure modes encountered
Шаблон для нового runbook'а
# Runbook: <alert name>
## Symptoms
- Что видит on-call (precise wording)
- Где (Grafana panel / Sentry / etc.)
## Severity & escalation
- PAGE 24/7 / NOTIFY / TICKET
- Ack window: 15 min
- Escalate если не resolved за: 30 min → engineering lead
## Immediate actions (< 5 min)
1. ...
2. ...
## Diagnosis (5-20 min)
- Check X — если Y, иди к Z runbook
- Check W — ...
## Recovery
- Steps to restore
- Verification: what success looks like
## Aftermath
- Post-mortem trigger: если downtime > X min
- Backfill needed: ...
## Known false positives
- ...