Runbooks
Runbook: Liveblocks outage

Symptoms

  • Sentry: spike LiveblocksError OR WebSocket connection failed от frontend
  • Multi-user workflow co-edit не работает (cursors не sync, broadcasts dropped)
  • POST /api/v1/liveblocks/auth returns 500 OR 502
  • Liveblocks dashboard (https://liveblocks.io (opens in a new tab)) показывает status issue OR fail rate

Severity & escalation

  • PAGE 24/7 — real-time collab отключён. Однако degraded mode возможен: REST + versions работают (Phase 12 MD edit), просто без presence/cursors
  • Ack window: 15 min
  • Escalate за 30 min → engineering lead
  • Long outage (>2h): communicate degraded mode к users через UI banner

Immediate actions (< 5 min)

  1. Check Liveblocks status: https://status.liveblocks.io/ (opens in a new tab) (или dashboard top bar)
  2. Check our auth endpoint: curl -X POST arno-api.vadimpianof.workers.dev/api/v1/liveblocks/auth ... с valid JWT — error?
  3. Check Liveblocks dashboard:
  4. Verify secret: wrangler secret list --config wrangler.toml | grep LIVEBLOCKSLIVEBLOCKS_SECRET_KEY present

Diagnosis (5-20 min)

Branch A: Liveblocks-side outage

  • Status page показывает incident → wait
  • Switch frontend в degraded mode banner:
    • "Real-time collab временно недоступен. Изменения сохраняются, видны после reload."
    • MD editor (Phase 12) продолжает работать через REST
  • Workflow canvas: read-only режим (user не может редактировать nodes/edges пока Liveblocks down — workflow primary storage там)

Branch B: Our auth endpoint broken

  • Tail wrangler — найти exception в liveblocks.ts::POST /api/v1/liveblocks/auth
  • Common причины:
    • LIVEBLOCKS_SECRET_KEY env var missing or rotated incorrectly
    • Project ID hardcoded mismatch (см. master spec §I.2.2 — project:${id} room naming)
    • User ownership check failed silently

Branch C: Rate limit hit

  • Liveblocks Free: 100 connections/mo concurrent. Если we crossed → 429 на auth
  • Check dashboard → Usage
  • Mitigation: upgrade к Liveblocks Pro $99/mo per cost ladder; OR апply startup credits (parking master spec §III.6)

Recovery

IssueAction
Liveblocks-side outageBanner degraded mode; wait. MD editor still works.
Secret missing/wrongecho -n "sk_dev_..." | wrangler secret put LIVEBLOCKS_SECRET_KEY --config wrangler.toml
Rate limitApply Liveblocks Startup program / upgrade Pro
Auth endpoint code bugRollback last deploy → fix → redeploy

Verification

  • POST /api/v1/liveblocks/auth с valid JWT returns 200 c session token
  • Browser test: open project → cursor от 2nd window visible
  • Sentry LiveblocksError rate < 0.1%
  • Liveblocks dashboard "Healthy" status

Aftermath

  • Post-mortem trigger: downtime > 30 min (collab harder to recover from than DB read-only)
  • Document: degraded mode duration, какие users impacted
  • Если rate limit hit — review usage trends, plan upgrade

Known false positives

  • Liveblocks Yjs Storage REST API throttling на bulk fetch — periodic 429 для individual room queries, не actual outage. Не PAGE. Add backoff in our code.
  • WebSocket reconnect storms при network blip на client-side — appears как many errors но self-heals