URL-import
Legal brief

Purpose: structured Q&A pack для privacy lawyer / GDPR specialist. Estimated cost: $500-1000 one-time review (per § XV blocker #5). Status: awaiting external review. Internal due diligence complete.

TL;DR for the lawyer

ARNO собирается launch'нуть URL-import feature: новый юзер при registration даёт URL → ARNO extracts React components из public page → юзер edit'ит in editor → optional push to git.

Three legal concerns:

  1. Shadow data collection для AI model training (anonymized)
  2. Copyright attestation на user-imported URLs
  3. GDPR compliance для extracted data + retention

Pre-drafted materials:

We seek validation that approach is GDPR/CCPA-compliant + identification of gaps.


Q1. GDPR legitimate interest basis для shadow data collection

Pattern: industry standard (Linear, Figma, Notion, GitHub Copilot, Cursor) — default ON anonymized data collection, disclosed in ToS, opt-out в Settings.

ARNO implementation:

  • URLs hashed via SHA-256 (irreversible)
  • Component metadata stripped of PII (user IDs, email, auth tokens, PII в query params)
  • Stored separately from user-identifiable account data
  • Disclosure: ToS + Privacy Policy section at registration
  • Opt-out: Settings → Privacy → "Anonymous data contribution" toggle
  • Right to erasure: "Delete past contributions" button → cascading delete

Specific questions:

  1. Is GDPR Art. 6(1)(f) legitimate interest basis appropriate? Or do we need Art. 6(1)(a) explicit consent (= registration checkbox)?

  2. Is our "anonymization" sufficient for GDPR Art. 4(5)? We hash URLs (one-way), but the hashed URLs are linked в shadow_dataset records. Records do not contain user_id или PII. But if attacker had original URL, they could rehash and match.

  3. Balancing test documentation — what specific evidence показывает that legitimate interest outweighs данных subject rights?

  4. EU AI Act (effective Aug 2026) — does training data provenance disclosure requirement apply here? Our model improves component extraction; не general-purpose AI.


Q2. Copyright attestation на user-imported URLs

Pattern: Юзер при импорте URL подтверждает: (a) Owned, (b) Licensed, (c) Public domain.

ARNO implementation:

  • Mandatory checkbox при URL submit
  • Required license_type selection
  • Additional confirmation для Tranco top 15k commercial domains
  • Attestation stored 7 years (legal audit)
  • ARNO не liable для third-party copyright claims

Specific questions:

  1. Is user attestation sufficient legal protection? Or do we need DMCA registered agent + safe harbor compliance?

  2. For "fair use" extracted UI patterns (e.g. small-biz импортирует apple.com по ошибке):

    • Is design pattern extraction "transformative" under fair use?
    • Or does it constitute derivative work requiring permission?
  3. Indemnification clause language — current draft:

    "You agree to indemnify ARNO against any third-party claims related to content you import." Sufficient? Or need stronger language?

  4. Cross-jurisdictional issues: юзер в EU импортирует US-hosted site. Whose law governs?


Q3. Domain reputation system (Tranco-based)

Pattern: Tranco top 15k domains flagged as "commercial" → require extra user confirmation.

ARNO implementation:

  • Daily refresh Bloom filter (memory ~10KB)
  • 1% false positive rate (Bloom filter inherent)
  • Attribution: "Domain classification powered by Tranco (tranco-list.eu), licensed under CC-BY 4.0" в Privacy Policy footer

Specific questions:

  1. Tranco CC-BY 4.0 attribution — placement in Privacy Policy footer sufficient? Or needs more prominent (e.g. each ARNO dashboard page)?

  2. Defamation risk — ARNO flags mywebsite.com as "commercial" → false positive embarrasses user. Liability?

  3. GDPR Art. 22 automated decision-making — does the warning constitute "decision" affecting user?


Q4. GDPR retention matrix

ARNO implementation (from spec § XI.4):

DataRetention
Component files (TSX, types, tokens)User lifetime
Original screenshots90 days
→ 90-365 daysHash + DINOv2 binary (not recoverable)
→ 365+ daysMetadata only
Staging area (active)< 90 days с last_activity
Staging area (notified)90-120 days, email "30 days до удаления"
Staging area (deleted)120 days (physical)
Shadow url_hash2 years
Shadow correctionsAnonymized после 90 days
User attestation7 years

Specific questions:

  1. Right-to-erasure (Art. 17) timeline — current promise "30 days". Is this acceptable, or stricter?

  2. Hashed data as "personal data" under GDPR? Some interpretations: hash with user-known input = pseudonymization не anonymization. Position needed.

  3. "Legitimate business need" для 7-year attestation retention — sufficient? Or need shorter timeframe?

  4. Cascade deletion B2 + DB — saga pattern с retries. 30-day deletion window achievable? Or need stricter SLA?


Q5. Third-party processors disclosure

Services receiving user data:

ServiceWhatWhy
Google (Gemini API)Anonymized component metadataText/vision analysis
CerebrasTSX code generation requestsCode generation
Modal LabsHosted compute (анализ workload)GPU inference
Backblaze B2Staging area files + shadow dataStorage
HyperbrowserTarget URL (when anti-bot detected)Browser automation
LiveblocksARNO editor real-time stateCollaboration

Specific questions:

  1. GDPR Art. 28 data processing agreements (DPAs) — needed для каждого? We have standard DPAs available from each vendor. Sufficient to reference?

  2. Schrems II / EU-US Data Privacy Framework — все эти services are US-based. For EU users, do we need SCCs (Standard Contractual Clauses) или DPF certification?

  3. Sub-processor disclosure — должны ли мы list each vendor publicly в Privacy Policy?

  4. Data localization claims: можем ли мы offer "EU data residency"? Currently no — все processing US. Impact for EU customers?


Q6. EU AI Act compliance (effective Aug 2026)

ARNO's AI usage:

  • Production: Gemini Flash-Lite for vision/text analysis (third-party)
  • Future (Q2+): self-hosted Qwen3-VL-32B + LoRA fine-tuned on user corrections

Specific questions:

  1. Is ARNO's URL-import "AI system" under AI Act? We extract structured data — not making decisions about people.

  2. Training data provenance — Qwen3-VL-235B-A22B teacher labels. Apache 2.0 license. We don't train on Anthropic/OpenAI outputs. Compliance with Art. 53 (training data disclosure)?

  3. High-risk category check — Annex III categories. We don't impact:

    • Biometric ID
    • Critical infrastructure
    • Education / employment
    • Law enforcement
    • Migration
    • Justice

    Compliance: low-risk system, basic transparency only?


Q7. California (CCPA / CPRA)

Specific questions:

  1. Notice at Collection — current ToS clause sufficient for "categories of personal information" disclosure?

  2. "Sale" of personal information — we don't sell, but does anonymized data sharing с Gemini API count?

  3. "Sensitive Personal Information" under CPRA — do we collect any? (probably no — but verify)

  4. GPC (Global Privacy Control) signal honoring — required to implement?


Q8. Children's data (COPPA, GDPR-K)

ARNO не targets minors. Юзер attests adult during registration.

Question: do we need explicit COPPA / GDPR-K disclaimer статement, или general ToS clause достаточно?


Q9. Liability limitations

Specific questions:

  1. "As-is" disclaimer language — current ToS draft. Enforceable in all jurisdictions?

  2. Limitation of liability cap — typical $X or 12-month fees? Recommendation для ARNO (small-biz target market)?

  3. Class action waiver / arbitration clauses — necessary for V1 launch?


Q10. Privacy Policy structure recommendation

Current draft has 5 clauses (Data Collection, URL Import, Tranco Attribution, Retention, Third-party processors).

Question: structure adequate, или нужны additional sections?

Standard GDPR Privacy Policy checklist (from regulators):

  • Identity + contact details of controller
  • DPO contact (если applicable)
  • Purposes of processing + legal basis
  • Recipients of personal data
  • International transfers + safeguards
  • Retention periods
  • Data subject rights (access, rectification, erasure, etc)
  • Right to withdraw consent
  • Right to lodge complaint with supervisory authority
  • Whether disclosure is statutory или contractual requirement
  • Automated decision-making + profiling
  • Categories of personal data (if not collected from data subject)

Which currently covered, which missing?


Recommended deliverables from lawyer

  1. Validated ToS clauses (1-5) — markup или rewrite
  2. Validated Privacy Policy — full document for production
  3. Risk assessment — biggest legal exposures + mitigation priority
  4. Standard contract templates: DPA template для potential enterprise customers
  5. 30-day post-launch follow-up — track regulatory changes (EU AI Act effective date specifically)

ARNO-side preparation для legal call

Лоер should review BEFORE call:

Vadim should be available for:

  • Specific scenarios discussion (edge cases)
  • Stack decisions justification (why Tranco, why default-ON, etc)
  • Cost-benefit trade-offs (what compliance костов acceptable for V1)

Cross-references