Runbooks
Runbook: Database (Neon Postgres) unreachable

Symptoms

  • Sentry: spike error_type: NeonConnectionError OR PostgresError
  • API endpoints возвращают 500 с code: "internal" и message mentioning DB
  • /ready endpoint reports degraded: true с hard dep "postgres"
  • Grafana: pg_up == 0 OR Postgres connection latency > 5s

Severity & escalation

  • PAGE 24/7 — writes полностью отключены. Read fallback (snapshots в KV) живёт 60s после last refresh, далее stale
  • Ack window: 15 min
  • Escalate за 30 min → engineering lead
  • При Neon-side outage без ETA → consider PITR restore to secondary region (Phase 16+ post-MVP)

Immediate actions (< 5 min)

  1. Reproduce: wrangler tail должен show DB errors stream
    cd apps/api && npx wrangler tail --config wrangler.toml --format=pretty
  2. Check Neon status: https://neon.tech/status (opens in a new tab)
  3. Open Neon dashboard: https://console.neon.tech (opens in a new tab) → project arno-prod (Frankfurt)
    • Если показывает Suspended (free tier auto-suspend) → click открыть, wake up. Resume ~10s
    • Если показывает Error / Down → wait OR contact Neon support
  4. Verify connection from CLI:
    psql 'postgresql://neondb_owner:***@ep-dry-block-al36bkvg.c-3.eu-central-1.aws.neon.tech/neondb?sslmode=require' -c 'SELECT 1'

Diagnosis (5-20 min)

Branch A: Neon suspended (free tier)

  • Это expected behaviour на idle period > 5 min для free
  • Mitigation: upgrade к Neon Launch ($19/mo) когда DAU > 50 OR business critical (per cost ladder)
  • Open project в dashboard → resume → verify /ready зелёный

Branch B: Neon up but Workers can't connect

  • Check DATABASE_URL wrangler secret format — должен быть postgresql://user:pass@host/db?sslmode=require
  • Check Neon allowlist (IP) — Workers exit IPs unstable, поэтому Neon должен accept any IP (default для serverless tier)
  • Check rate limit — Neon free 100 concurrent connections, мы used Drizzle HTTP driver (stateless, no pool)
  • Run direct query via tools/migrate/ (Pool driver) → если works → HTTP driver issue (unlikely)

Branch C: Neon-side outage

  • Status page показывает incident → wait
  • Inform users via status page если > 5 min downtime
  • Consider PITR restore: Neon → branch → PITR to point before outage

Recovery

IssueAction
Suspended (free tier)Resume in dashboard; wake takes ~10s
Bad DATABASE_URLwrangler secret put DATABASE_URL --config wrangler.toml < new-url.txt
Region outageWait Neon, OR — emergency — branch creation в другой region + repoint DATABASE_URL (PITR-based)
Rate-limitInvestigate runaway queries (Sentry traces) → kill stuck connections via Neon dashboard

Verification

  • /ready endpoint returns degraded: false
  • API endpoints с write (POST /api/v1/projects) succeed
  • Sentry error rate возвращается к baseline ≤5 min after recovery
  • Grafana pg_up == 1 persistent for 5 min

Aftermath

  • Post-mortem trigger: downtime ≥ 10 min
  • Document: timeline, root cause, was PITR needed?
  • Если free tier suspend случается часто (>1/неделю) → upgrade к Launch

Known false positives

  • PITR backups window (нощно ~03:00 UTC) — short read latency spike, не unreachable. Не PAGE
  • Test queries from ARNO CI — могут briefly увеличивать connection count; не PAGE