- Date: 2026-05-22
- Status: Accepted
- Feature: URL-import (data flywheel — ADR 0013)
- Affects: url_import_spec.md § XI.3, § X.1
Context
Data flywheel (ADR 0013) требует shadow dataset — log of production extractions для quarterly LoRA training (ADR 0014).
Vопрос UX/legal: как получить consent на data collection?
Initial proposal (rejected): registration checkbox
☑ Я владею правом на этот URL
☐ Использовать мой импорт для улучшения ARNO (анонимно) ← default OFFProblems:
- Default OFF → opt-in rate ~15% (industry baseline для secondary checkboxes)
- Flywheel works в 5-10× slower → LoRA Q1 training tренируется на 1-2k examples вместо 10k
- Accuracy below baseline → no LoRA deployment → § 0.8 "чем дольше живём — меньше платим" requirement broken
- Data flywheel multiplier dies
Юзерское contre-proposal:
"А почему нельзя это делать в background и не показывать пользователю? Просто особенность системы."
Decision
ToS-based disclosure (industry pattern Linear / Figma / Notion / Sentry):
- At registration: юзер accepts Terms of Service, которые disclose shadow data collection
- No registration checkbox — clean UX, less friction
- Settings → Privacy has opt-out toggle (default ON)
- Privacy Policy footer discloses Tranco attribution + third-party services
ToS clause text
By using ARNO ("Service"), you acknowledge that we collect anonymized
usage data including:
- Anonymized URLs (cryptographically hashed via SHA-256)
- Component metadata (without personally identifiable information)
- Extraction outcomes and any corrections you make
This data is used solely to improve Service accuracy and quality. We
do not sell or share this data with third parties.
You may opt out at any time via Settings → Privacy → "Anonymous data
contribution". Past contributions can be deleted via Settings →
Privacy → "Delete past contributions".
For European users (GDPR), this processing relies on legitimate
interest in service improvement (Art. 6(1)(f)). You have the right
to object at any time without affecting Service availability.Полный draft → url_import_tos_clause_draft.md. Legal review pending (~$500-1000 with privacy lawyer).
Why this works
Legal foundation
- GDPR Art. 6(1)(f) legitimate interest — service improvement is recognized purpose
- GDPR Art. 13/14 — disclosure satisfied via ToS + Privacy Policy
- GDPR Art. 17/21 — opt-out + delete past contributions = right-to-erasure + right-to-object
- CCPA — disclosure at collection point + opt-out mechanism
Industry pattern
| Company | Approach |
|---|---|
| Linear | Usage analytics ToS-disclosed, opt-out in settings |
| Figma | Anonymized usage data, ToS-based, opt-out |
| Notion | Telemetry ToS-disclosed, settings toggle |
| Sentry | Performance data ToS-disclosed |
| GitHub Copilot | Code suggestions, ToS-disclosed, settings opt-out |
Industry baseline — default ON, ToS-disclosed, settings opt-out.
Opt-out rate realistic
Industry data:
- Default OFF + checkbox: ~15% opt-in
- Default ON + settings opt-out: ~85-95% effective participation (5-15% opt-out)
ToS-based achieves ~95-100% effective opt-in (only ~0-5% bother to find settings opt-out).
Flywheel mathematics:
- 15% participation → 1.5k examples/quarter → insufficient for LoRA
- 95% participation → 9.5k examples/quarter → meets LoRA training threshold
UX implementation
Registration UI (clean)
URL: [_______________________]
☑ Я владею правом на этот URL
Лицензия: ⊙ Owned ○ Licensed ○ Public domain
⚠️ [warning для Tranco top 15k commercial sites]
[Принимая Terms of Service, вы соглашаетесь с использованием
анонимизированных данных. Подробнее в Privacy Policy.]
[Импортировать]Не показывается registration checkbox для shadow data — это в ToS.
Settings → Privacy
Privacy Settings
☑ Anonymous data contribution (default ON)
Help improve ARNO by allowing anonymized usage data.
[Learn more] [Delete past contributions]Settings opt-out applies к ВСЕМ data collection (shadow + analytics).
Implementation backend
async function logToShadowDataset(extraction, user) {
if (!user.settings.contributeData) return; // opt-out check
const isGold = extraction.user_corrections != null;
const shouldSample = isGold || Math.random() < 0.10; // 100% gold + 10% uncorrected
if (!shouldSample) return;
await b2.upload({
url_hash: sha256(extraction.url),
capture_matrix: extraction.screenshots,
component_spec: extraction.spec,
tsx: extraction.tsx,
user_corrections: extraction.corrections ?? null,
mode: extraction.mode,
cost_usd: extraction.cost,
latency_ms: extraction.latency
});
}Anonymization technical
URL → SHA-256 hash (raw URL not stored). Component metadata stripped of:
- User IDs
- Email addresses
- Auth tokens
- Personal data в URLs (query params containing PII)
GDPR "anonymized" standard met (irreversible, no re-identification possible).
Consequences
Pros:
- Flywheel mathematics work — sufficient data для quarterly LoRA
- Cleaner registration UX — less friction
- Industry-standard approach — proven legal pattern
- § 0.8 requirement satisfied
Cons:
- Requires ToS/Privacy Policy legal review (~$500-1000)
- Settings opt-out path must be prominent (UX work)
- Discovery: некоторые юзеры найдут opt-out only after months
Risks
| Risk | Mitigation |
|---|---|
| Privacy backlash if discovered | Industry-standard disclosure, prominent settings toggle, transparent docs |
| GDPR fine if found insufficient | Pre-launch legal review, anonymization meets standard, opt-out within 30 days |
| EU AI Act (Aug 2026) changes requirements | Re-review legal clauses post-effective date, update if needed |
Alternatives rejected
A. Registration checkbox default OFF
- ❌ Flywheel mathematics broken (15% opt-in insufficient)
- ❌ Breaks § 0.8 hard requirement
B. Registration checkbox default ON
- ❌ Marginal benefit over ToS approach
- ❌ Adds registration friction
- ❌ Dark pattern accusation risk
C. Hidden collection (no disclosure)
- ❌ GDPR violation
- ❌ Reputational disaster if discovered
- ❌ ILLEGAL — не consider
D. Hidden collection + ToS clause только
- ❌ Settings opt-out missing = no right-to-object
- ❌ GDPR Art. 21 violation
Cross-references
- Main spec § XI.3
- Main spec § X.1
- ToS draft — full clause text
- ADR 0013 — flywheel cost reduction requirement
- ADR 0014 — what training pipeline does with collected data