ADRs
ADR 0014 — Legal-clean distillation: Qwen teacher, не Anthropic/OpenAI

Context

Cost reduction multiplier 2 (ADR 0013) — data flywheel через quarterly LoRA training. Это требует teacher модели для labelling unlabelled production extractions.

Юзерский вопрос:

"А можно используя платный ресурс на нём обучать свой — что если 100 000 долларов прогнанные через платный у нас будет свой бесплатный?"

Naive answer: использовать Claude API ($X) для generating labels → training data → in-house модель → $0 long-term.

Critical issue (discovered после research):

Anthropic Usage Policy (Feb 2026) — explicit prohibition:

You may not use outputs from Anthropic services to:
  (a) train, fine-tune, or develop AI/ML models that compete
      with our services
  (b) extract embeddings or representations for downstream
      model training

OpenAI ToS — similar restrictions (analogous clauses).

Если использовать Claude или OpenAI outputs для training → ToS violation → legal exposure + service ban risk.

Decision

Teacher selection ограничена Apache 2.0 / MIT моделями (commercial use + redistribution + derivative works разрешены):

TeacherLicenseQuality vs Claude 4.6Cost per 1M out
Qwen3-VL-235B-A22B-InstructApache 2.0~92%$0.50
DeepSeek-V3DeepSeek License (commercial OK)~88%$0.27
Llama 3.3 70BLlama Community License~85%$0.59

Primary teacher: Qwen3-VL-235B (best quality + cleanest Apache 2.0). Backup: DeepSeek-V3 (cheaper, slightly less quality, commercial-friendly license).

Pipeline

Quarterly training:
  1. Collect ~10k production extractions с user corrections (gold labels)
  2. ~50k uncorrected extractions → Qwen3-VL-235B teacher labels them
  3. Split 90/10 train/holdout
  4. LoRA fine-tune Qwen3-VL-32B (Apache 2.0 student)
     - Rank 16, alpha 32, lr 1e-4
     - A100 на Modal, ~$110-300/run
  5. Validate ≥ pareto criteria (см § X.4)
  6. Deploy через 5% A/B → ramp

Все используемые модели Apache 2.0 → output redistributable, modifiable, без ToS restrictions на downstream training.

Why Qwen primary

  1. Best quality среди Apache 2.0 models (Qwen3 series)
  2. Same architecture family как student (Qwen3-VL-32B) → smoother distillation
  3. Active development — Alibaba commits to open-source releases
  4. Multilingual — handles RU/EN/CN sites uniformly
  5. Already adopted на Together AI, Replicate, Modal — easy infra

Why exclude Claude/OpenAI

  1. ToS violation — direct prohibition (Feb 2026)
  2. Conflict of interest — ARNO не должен зависеть от competitors as teacher
  3. Legal exposure — even hidden distillation discoverable through canary prompts; ban + lawsuit risk

Anthropic ToS detail

Quote из ADR 0015 (detailed analysis):

"You may not use outputs from Anthropic services to train, fine-tune,
or develop AI/ML models that compete with our services, or extract
embeddings or representations for downstream model training."

Это explicitly excludes:

  • Generating training labels via Claude (даже indirect через synthetic data)
  • Embedding extraction для downstream training
  • Fine-tuning student model on Claude responses

Pipeline исключает Claude entirely (см ADR 0015).

Consequences

Pros:

  • Pipeline fully legally compliant
  • No vendor lock-in to commercial AI providers
  • Student model Apache 2.0 → can be open-sourced eventually if ARNO chooses
  • Multilingual quality preserved (Qwen strong on EN/RU/CN)

Cons:

  • Slightly lower teacher quality vs Claude 4.6 (92% vs 100% baseline)
    • Mitigated: user corrections (gold labels) > teacher labels weight в training
  • Active migration if Qwen license changes
    • Mitigated: 3 teacher options, can switch

Risks

RiskMitigation
Qwen license changes (unlikely but possible)DeepSeek backup, Llama secondary
Quality gap teacher → student affects deploymentPareto criteria + escape valve (§ X.4)
New teacher model emerges better than QwenQuarterly evaluation, switch if pareto-improvement

Alternatives rejected

A. Use Claude API anyway (hope ToS not enforced)

  • ❌ ToS violation = service termination risk
  • ❌ Legal exposure on training data provenance audit
  • ❌ Conflict of interest (ARNO using competitor's product)

B. No distillation, pay Gemini forever

  • ❌ Breaks ADR 0013 cost reduction requirement
  • ❌ Vendor lock-in to Google

C. Train from scratch (no teacher)

  • ❌ 100×+ training cost
  • ❌ Need massive labelled dataset upfront

Cross-references