Runbooks
Runbook: URL-import SSRF guard rejection spike

Symptoms

  • Grafana alert: rate(url_import_ssrf_rejections_total[5m]) > 10/min
  • Sentry: SSRFError exceptions spike > baseline 7-day avg
  • API logs: many 400 responses from extractor с reason safe_fetch_failed

Severity & escalation

  • INVESTIGATE (not PAGE — SSRF rejections это правильное поведение защиты)
  • Ack window: 4 hours business hours, next-day off-hours
  • Escalate если sustained > 1 hour OR pattern indicates infrastructure compromise

Immediate actions (< 5 min)

  1. Check rejection reasons distribution:

    # Query Sentry / log aggregator for SSRFError reasons
    # Common reasons: scheme_not_allowed, no_safe_ips, dns_resolve_failed
  2. Sample failing URLs (anonymized):

    • Если все private IP ranges → нормальная attack/scan from external
    • Если valid domains → DNS issue OR config drift
  3. Cross-reference с user_ids:

    • Single user spamming? → anti-abuse (см user_id rate limit)
    • Distributed across users? → external scan against extraction endpoint

Diagnosis (5-20 min)

Branch A: All rejections к private IP ranges

Это normal SSRF guard поведение. Causes:

  • Скан внешний попытаться SSRF против ARNO
  • Юзер ввёл URL который resolves к 127.0.0.1 (typo, dev URL)
  • DNS rebinding attempt

Action: проверить нет ли уже rate limits на user_id. Если single user — possibly anti-abuse review.

Branch B: Rejections для valid public domains

Возможные causes:

  • DNS resolver issues (Cloudflare DNS proxy outage?)
  • BLOCKED_NETWORKS list misconfigured (e.g. accidentally added public range)
  • IPv6 false positives (новый ipv6 prefix не в block list correctly)

Action:

  1. Verify resolution from Worker context:
    wrangler tail | grep "dns_resolve_failed"
  2. Compare с public DNS resolution:
    dig +short example.com  # whatever URL failing
  3. Check BLOCKED_NETWORKS список в packages/url-import-extractor/src/safe-fetch.ts — verify не добавили public range недавно

Branch C: dns_resolve_failed bursts

Indicates external DNS layer issue:

  • Cloudflare DNS outage
  • Network egress from Workers blocked
  • Specific TLD не resolvable

Action: check Cloudflare status page. If CF issue — wait, document outage.

Recovery

If valid SSRF protection (Branch A):

  • No action — system working as designed
  • Optionally: send anti-abuse notification к offending user
  • Verify rate limits on user_id are tight enough

If false positives (Branch B):

  • Hotfix BLOCKED_NETWORKS list в safe-fetch.ts
  • Deploy backend update
  • Re-test failing URLs

If DNS layer issue (Branch C):

  • Cannot fix at ARNO layer — wait для CF/upstream resolution
  • Communicate к users (status page) если sustained

Aftermath

  • Post-mortem если sustained > 30min OR caused customer-visible failures
  • Backfill SSRFError metrics в Grafana если missed
  • Tune alert threshold если too noisy (currently 10/min — может needs higher floor for high-traffic periods)

Known false positives

  • Tranco-listed domain с regional CDN: DNS resolves к geo-specific IP that occasionally hits filtered range. Document specific case if recurring.
  • IPv6 addresses в shared CGNAT (carrier-grade NAT) — может look like private. Verify prefix correctly in BLOCKED_NETWORKS.

References