Legal brief – Nextra

Purpose: structured Q&A pack для privacy lawyer / GDPR specialist. Estimated cost: $500-1000 one-time review (per § XV blocker #5). Status: awaiting external review. Internal due diligence complete.

TL;DR for the lawyer

ARNO собирается launch'нуть URL-import feature: новый юзер при registration даёт URL → ARNO extracts React components из public page → юзер edit'ит in editor → optional push to git.

Three legal concerns:

Shadow data collection для AI model training (anonymized)
Copyright attestation на user-imported URLs
GDPR compliance для extracted data + retention

Pre-drafted materials:

url_import_tos_clause_draft.md — proposed ToS/Privacy clauses (5 sections)
url_import_spec.md § XI.3-XI.4 — technical implementation

We seek validation that approach is GDPR/CCPA-compliant + identification of gaps.

Q1. GDPR legitimate interest basis для shadow data collection

Pattern: industry standard (Linear, Figma, Notion, GitHub Copilot, Cursor) — default ON anonymized data collection, disclosed in ToS, opt-out в Settings.

ARNO implementation:

URLs hashed via SHA-256 (irreversible)
Component metadata stripped of PII (user IDs, email, auth tokens, PII в query params)
Stored separately from user-identifiable account data
Disclosure: ToS + Privacy Policy section at registration
Opt-out: Settings → Privacy → "Anonymous data contribution" toggle
Right to erasure: "Delete past contributions" button → cascading delete

Specific questions:

Is GDPR Art. 6(1)(f) legitimate interest basis appropriate? Or do we need Art. 6(1)(a) explicit consent (= registration checkbox)?
Is our "anonymization" sufficient for GDPR Art. 4(5)? We hash URLs (one-way), but the hashed URLs are linked в shadow_dataset records. Records do not contain user_id или PII. But if attacker had original URL, they could rehash and match.
Balancing test documentation — what specific evidence показывает that legitimate interest outweighs данных subject rights?
EU AI Act (effective Aug 2026) — does training data provenance disclosure requirement apply here? Our model improves component extraction; не general-purpose AI.

Q2. Copyright attestation на user-imported URLs

Pattern: Юзер при импорте URL подтверждает: (a) Owned, (b) Licensed, (c) Public domain.

ARNO implementation:

Mandatory checkbox при URL submit
Required license_type selection
Additional confirmation для Tranco top 15k commercial domains
Attestation stored 7 years (legal audit)
ARNO не liable для third-party copyright claims

Specific questions:

Is user attestation sufficient legal protection? Or do we need DMCA registered agent + safe harbor compliance?
For "fair use" extracted UI patterns (e.g. small-biz импортирует apple.com по ошибке):
- Is design pattern extraction "transformative" under fair use?
- Or does it constitute derivative work requiring permission?
Indemnification clause language — current draft:

"You agree to indemnify ARNO against any third-party claims related to content you import." Sufficient? Or need stronger language?
Cross-jurisdictional issues: юзер в EU импортирует US-hosted site. Whose law governs?

Q3. Domain reputation system (Tranco-based)

Pattern: Tranco top 15k domains flagged as "commercial" → require extra user confirmation.

ARNO implementation:

Daily refresh Bloom filter (memory ~10KB)
1% false positive rate (Bloom filter inherent)
Attribution: "Domain classification powered by Tranco (tranco-list.eu), licensed under CC-BY 4.0" в Privacy Policy footer

Specific questions:

Tranco CC-BY 4.0 attribution — placement in Privacy Policy footer sufficient? Or needs more prominent (e.g. each ARNO dashboard page)?
Defamation risk — ARNO flags mywebsite.com as "commercial" → false positive embarrasses user. Liability?
GDPR Art. 22 automated decision-making — does the warning constitute "decision" affecting user?

Q4. GDPR retention matrix

ARNO implementation (from spec § XI.4):

Data	Retention
Component files (TSX, types, tokens)	User lifetime
Original screenshots	90 days
→ 90-365 days	Hash + DINOv2 binary (not recoverable)
→ 365+ days	Metadata only
Staging area (active)	< 90 days с last_activity
Staging area (notified)	90-120 days, email "30 days до удаления"
Staging area (deleted)	120 days (physical)
Shadow url_hash	2 years
Shadow corrections	Anonymized после 90 days
User attestation	7 years

Specific questions:

Right-to-erasure (Art. 17) timeline — current promise "30 days". Is this acceptable, or stricter?
Hashed data as "personal data" under GDPR? Some interpretations: hash with user-known input = pseudonymization не anonymization. Position needed.
"Legitimate business need" для 7-year attestation retention — sufficient? Or need shorter timeframe?
Cascade deletion B2 + DB — saga pattern с retries. 30-day deletion window achievable? Or need stricter SLA?

Q5. Third-party processors disclosure

Services receiving user data:

Service	What	Why
Google (Gemini API)	Anonymized component metadata	Text/vision analysis
Cerebras	TSX code generation requests	Code generation
Modal Labs	Hosted compute (анализ workload)	GPU inference
Backblaze B2	Staging area files + shadow data	Storage
Hyperbrowser	Target URL (when anti-bot detected)	Browser automation
Liveblocks	ARNO editor real-time state	Collaboration

Specific questions:

GDPR Art. 28 data processing agreements (DPAs) — needed для каждого? We have standard DPAs available from each vendor. Sufficient to reference?
Schrems II / EU-US Data Privacy Framework — все эти services are US-based. For EU users, do we need SCCs (Standard Contractual Clauses) или DPF certification?
Sub-processor disclosure — должны ли мы list each vendor publicly в Privacy Policy?
Data localization claims: можем ли мы offer "EU data residency"? Currently no — все processing US. Impact for EU customers?

Q6. EU AI Act compliance (effective Aug 2026)

ARNO's AI usage:

Production: Gemini Flash-Lite for vision/text analysis (third-party)
Future (Q2+): self-hosted Qwen3-VL-32B + LoRA fine-tuned on user corrections

Specific questions:

Is ARNO's URL-import "AI system" under AI Act? We extract structured data — not making decisions about people.
Training data provenance — Qwen3-VL-235B-A22B teacher labels. Apache 2.0 license. We don't train on Anthropic/OpenAI outputs. Compliance with Art. 53 (training data disclosure)?
High-risk category check — Annex III categories. We don't impact:
- Biometric ID
- Critical infrastructure
- Education / employment
- Law enforcement
- Migration
- Justice
Compliance: low-risk system, basic transparency only?

Q7. California (CCPA / CPRA)

Specific questions:

Notice at Collection — current ToS clause sufficient for "categories of personal information" disclosure?
"Sale" of personal information — we don't sell, but does anonymized data sharing с Gemini API count?
"Sensitive Personal Information" under CPRA — do we collect any? (probably no — but verify)
GPC (Global Privacy Control) signal honoring — required to implement?

Q8. Children's data (COPPA, GDPR-K)

ARNO не targets minors. Юзер attests adult during registration.

Question: do we need explicit COPPA / GDPR-K disclaimer статement, или general ToS clause достаточно?

Q9. Liability limitations

Specific questions:

"As-is" disclaimer language — current ToS draft. Enforceable in all jurisdictions?
Limitation of liability cap — typical $X or 12-month fees? Recommendation для ARNO (small-biz target market)?
Class action waiver / arbitration clauses — necessary for V1 launch?

Q10. Privacy Policy structure recommendation

Current draft has 5 clauses (Data Collection, URL Import, Tranco Attribution, Retention, Third-party processors).

Question: structure adequate, или нужны additional sections?

Standard GDPR Privacy Policy checklist (from regulators):

Which currently covered, which missing?

Recommended deliverables from lawyer

Validated ToS clauses (1-5) — markup или rewrite
Validated Privacy Policy — full document for production
Risk assessment — biggest legal exposures + mitigation priority
Standard contract templates: DPA template для potential enterprise customers
30-day post-launch follow-up — track regulatory changes (EU AI Act effective date specifically)

ARNO-side preparation для legal call

Лоер should review BEFORE call:

url_import_spec.md — full technical spec (especially § XI.3-XI.4)
url_import_tos_clause_draft.md — draft clauses
This document — Q&A pack

Vadim should be available for:

Specific scenarios discussion (edge cases)
Stack decisions justification (why Tranco, why default-ON, etc)
Cost-benefit trade-offs (what compliance костов acceptable for V1)

Cross-references

url_import_spec.md § XV — blocker tracking
url_import_tos_clause_draft.md — pre-drafted clauses
ADR 0017 — shadow data approach rationale
ADR 0016 — staging area (retention implications)

Atoms (A1–A8)ToS clause