Written by Susan Miller*

Precision Contract English for ML Training Data: Carve‑Outs for Model Training and Rights in Derivative Models

Worried your “training only” data license quietly slides into product deployment risk? This lesson shows you how to draft and negotiate precise carve‑outs that separate model training from productization, and how to allocate data, model, and output rights with US/UK nuance. You’ll get surgical explanations, side‑by‑side clause models, and real‑world examples, followed by targeted checks and short exercises to confirm comprehension. Expect clear, defensible language you can drop into your next ML data deal with confidence.

Concept framing and definitions

In machine‑learning data licensing, a “carve‑out” is a specific exception that clarifies what the licensee may not do, even if the general grant of rights sounds broad. In Precision Contract English for ML training data, the most important carve‑outs separate two activities: (1) model training and (2) productization (use of trained models in commercial products or outputs offered to third parties). A training‑use grant permits ingesting, copying, and transforming the dataset to tune or train models. A productization carve‑out withholds permission to deploy the trained model or its outputs in a commercial service unless the contract expressly extends the license. This separation is crucial because data owners often accept the lower risk of internal research but want control or compensation when a model or its outputs reach the market.

Equally important is to distinguish data rights from model rights. Data rights concern the licensed dataset itself—its copyright, database rights, contractual controls, and privacy constraints. Model rights concern the trained parameters, weights, and artifacts produced by training. Even when training is allowed, the licensor may restrict the creation of derivative models or claim ownership or veto rights over them. A third layer is output rights: who may use, sell, or license the text, images, code, or predictions generated by the trained model. A careful contract should draw clean boundaries among these three layers—dataset, model, and outputs—because they trigger different legal regimes (copyright, database rights, trade secrets, privacy/data protection, and often open‑source or platform terms).

A practical way to structure clauses is to adopt a baseline clause architecture that tracks the lifecycle of the ML workflow:

  • Grant of rights: Specify the license scope (copy, store, transform, analyze), territory, term, and permitted users (internal affiliates, contractors). Name training activities expressly (e.g., “train, fine‑tune, evaluate”).
  • Restrictions / carve‑outs: Withhold productization, distribution of the dataset, weight sharing, and competitive uses by default. Add privacy, security, and compliance constraints.
  • Ownership of models and outputs: State who owns trained models, fine‑tuned derivatives, and inference outputs; address database rights and sui generis rights; reserve licensor’s IP and moral rights.
  • Warranties: Tailor to data provenance, rights clearance, and lawful processing; clarify exclusions for statistical accuracy.
  • Indemnities: Allocate risk for IP infringement, privacy violations, and scraping controversies; distinguish first‑party and third‑party claims.
  • Enforcement: Provide audit rights, injunctive relief, suspension/termination triggers, model deletion or quarantine obligations, and post‑termination handling of checkpoints and embeddings.

When parties use the phrase “carve‑outs for model training,” they are typically trying to allow ingestion and experimentation while preventing silent scope creep into product deployment, model weight distribution, or output commercialization. Because ML systems blur the line between data and model, a precision contract must state whether training creates a derivative work of the dataset (often contested) and whether derivative model rights vest in the licensee, licensor, or both. The contract must also make clear whether outputs that resemble the dataset are restricted, how memorization is handled, and whether dataset‑specific filters or guardrails are required.

US vs UK drafting contrasts

US drafting often relies on broad grants with explicit, conspicuous exclusions and disclaimers. A US license may say “Licensee may use the Data to train internal models” in a generous way and then insert boldface or ALL CAPS carve‑outs for deployment, benchmarking publication, or redistribution. US contracts frequently include “AS IS” disclaimers, express waiver of implied warranties to the fullest extent permitted, and limitation of liability caps with carve‑outs for indemnity or wilful misconduct. The style favors enumerated verbs—“copy, store, process, transform, train, fine‑tune, evaluate”—followed by a sharp list of prohibited acts—“no productization, no weight sharing, no re‑identification.” The goal is to create clarity through explicit waivers and the narrowing force of disclaimers and limitations.

UK drafting tends to begin with narrower grants and relies on careful management of implied terms under statute and common law. Rather than an expansive permission followed by loud exclusions, a UK clause may grant only what is needed—“use of the Data solely for internal model training and evaluation”—and then extend by exception. UK law’s reasonableness tests (e.g., under the Unfair Contract Terms Act) influence how disclaimers and limits are drafted. The language is often more conservative about excluding implied terms of quality or fitness and will justify limitations with “reasonable” standards. Verb choices favor restrained constructions like “may use for the Permitted Purpose,” “shall not deploy,” and “save as expressly stated.” The drafting also foregrounds data protection compliance (UK GDPR) with defined terms and layered obligations for DPIAs, anonymization, and international transfers.

On ownership, US forms commonly assert that trained models and their parameters are owned by the licensee unless expressly assigned, coupled with assurances that training does not transfer rights in the source data. UK forms may avoid blanket ownership declarations and instead tie ownership to the parties’ pre‑existing IPR (intellectual property rights), stating that no assignment occurs absent express wording, and clarifying that any new IPR in trained models belongs to the licensee subject to restrictions, or to the licensor where the data is particularly proprietary or bespoke. In both jurisdictions, you should watch how implied licenses, estoppel, and post‑termination residual knowledge are handled; the US will often include explicit residuals clauses, while UK drafters may pare these back or condition them on confidentiality carve‑outs.

Warranties and indemnities also show jurisdictional differences. US contracts routinely include specific indemnities for third‑party IP claims and privacy violations, backed by duty‑to‑defend language and control‑of‑defense mechanics. UK contracts often express indemnities against “Losses” with exclusions and proportionality, sometimes avoiding the US‑style defense obligations in favor of cooperation and consent mechanisms. The UK approach will frequently reference reasonableness, mitigation duties, and statutory compliance (including Data Protection Legislation definitions). Still, both systems need precision around provenance, scraping risk, and personal data treatment to avoid ambiguity traps.

Model clauses and rewrites (paired US/UK approaches)

A. Training‑use license with productization carve‑outs

  • US emphasis: A broad internal training license with conspicuous restrictions. Typical features include a well‑defined “Permitted Purpose,” explicit verbs for technical acts (copy, cache, embed, tokenize), and a bright‑line prohibition on production deployment, weight release, or fine‑tunes being used in customer‑facing tools without a separate commercial license. Disclaimers will be direct: no accuracy or fitness warranties, and no promise that training will not memorize or reproduce dataset elements; instead, obligations are placed on the licensee to implement guardrails.
  • UK emphasis: A narrower grant that strictly limits use to internal model development and evaluation for the Permitted Purpose. Productization is not merely prohibited; it is outside the grant. The clause will also reference UK GDPR compliance, lawful basis, minimisation, and ensure that any international transfers or processors are governed by appropriate safeguards. Disclaimers rely on carefully drafted reasonableness language, and any exclusion of implied terms is measured and justified.

B. Ownership of derivative models and outputs

  • US emphasis: Clear statement that the licensor retains all rights in the Data, while the licensee owns models, fine‑tuned derivatives, and outputs, subject to restrictions (e.g., no disclosing model weights trained solely on Licensor Data; no use of outputs to reconstitute the Data). Include a carve‑out that training does not assign or license any exclusive rights in the Data, and that use of outputs must not infringe third‑party rights or violate content policies.
  • UK emphasis: Ownership follows existing IPR, with explicit confirmation that no assignment occurs absent express wording. The trained model’s IPR may vest in the licensee but is conditioned on non‑use beyond the Permitted Purpose without further agreement. Outputs are licensed for internal evaluation unless and until a commercialisation schedule is executed, and moral rights and database rights are reserved to the licensor. Reasonable steps must be taken to avoid reproducing protected expressions from the Data.

C. Warranties and indemnities tailored to training data

  • US emphasis: Licensor warrants it has sufficient rights to license the Data for the specified training purpose and that it has complied with applicable notice/consent obligations where personal data exists, subject to a cap. Licensee warrants compliance with law, implementation of privacy safeguards, and adherence to technical restrictions (no re‑identification, no scraping of outputs to bypass terms). Indemnities cover third‑party IP infringement, data protection claims, and scraping‑related trespass/unfair competition allegations, often with duty‑to‑defend terms and exclusions where the licensee combines the Data with other sources or uses beyond scope.
  • UK emphasis: Licensor offers limited warranties, framed as being to a reasonable standard and subject to statutory controls. Licensee gives warranties about compliance with Data Protection Legislation, carrying out DPIAs where necessary, and applying appropriate technical and organisational measures. Indemnities are crafted with reasonableness, proportionality, and mitigation obligations, with cooperation clauses and approval rights for settlements. Caps and carve‑outs align with UK reasonableness tests.

D. Enforcement mechanisms

  • US emphasis: Robust audit rights, rapid suspension for breach, and injunctive relief acknowledging irreparable harm if weights or datasets leak. Termination triggers include productization without permission, redistribution of weights, or re‑identification attempts. Post‑termination, models trained solely on the Data may need deletion or quarantine, with attestation. Survival clauses protect confidentiality, IP restrictions, and audit for a defined period.
  • UK emphasis: Similar tools with greater focus on proportionality and data protection compliance. Audits are subject to reasonable notice and confidentiality. Injunctions are available but framed alongside commitments to alternative dispute resolution. Termination includes mandatory deletion or anonymisation, with secure destruction certificates. The contract accommodates statutory regulator requests and lawful suspensions to meet compliance obligations.

Variations by data scenario add further precision. For open‑source or openly licensed data, clauses must harmonise with the source license; a US drafter may include compatibility statements and no‑copyleft triggers for weights, while a UK drafter may define the scope to avoid creating derivative databases under sui generis rights. For privacy‑laden data, both jurisdictions will elevate anonymisation standards, purpose limitation, and data subject rights handling. For scraped data, risk allocation around trespass, anti‑circumvention, and terms‑of‑service breaches becomes central, with indemnity carve‑outs if the licensee knowingly seeds models with disputed sources.

Diagnostic checklist and micro‑exercises

A workable diagnostic checklist helps you spot ambiguity traps and keep “Precision Contract English for ML Training Data: Carve‑Outs for Model Training and Rights in Derivative Models” at the forefront:

  • Scope clarity: Does the grant name the Permitted Purpose and technical verbs needed for training? Are productization, deployment, benchmarking publication, and weight sharing expressly outside scope?
  • Separation of layers: Are data rights, model rights, and output rights defined and treated distinctly? Is any claim that a trained model is a derivative work of the Data addressed explicitly?
  • Ownership and use‑of‑results: Who owns trained models, fine‑tunes, checkpoints, and embeddings? What rights exist to outputs, and are there safeguards against reproducing protected content?
  • Compliance matrix: Are privacy/data protection, IP, database rights, consumer protection, and scraping/ToS issues addressed? Are cross‑border transfers and processor obligations clear?
  • Warranties calibrated to reality: Are provenance, notice/consent, and lawful basis warranted by the licensor to the level the licensor can support? Are accuracy and fitness expressly not warranted?
  • Indemnity alignment: Do indemnities match the risk profile (IP, privacy, scraping) and exclude out‑of‑scope or licensee‑caused combinations? Are caps and defense controls jurisdiction‑appropriate?
  • Enforcement leverage: Are audit, suspension, and injunctive relief present? Are deletion/quarantine obligations and attestations defined for post‑termination handling of models and weights?
  • Ambiguity traps: Avoid vague phrases like “commercial use” without definition, undefined “derivative work,” or silent permission for outputs. Clarify memorisation controls and safeguards.

Applying this checklist ensures your clauses resist scope creep and maintain clean boundaries. It keeps the analysis grounded in the central theme: carve‑outs for model training versus productization and the allocation of rights in derivative models and outputs, with US/UK nuances carefully reflected. By systematically checking scope, ownership, compliance, and enforcement, you transform high‑risk generalities into precise, reliable contract language suitable for modern ML workflows.

  • Separate training from productization: licenses may allow internal training (copy, transform, fine-tune, evaluate) but withhold deployment, weight sharing, and output commercialization unless expressly granted.
  • Distinguish rights layers: data rights (dataset), model rights (trained parameters/derivatives), and output rights (generated content) must be defined and allocated separately.
  • Draft with jurisdictional nuance: US favors broad grants with explicit carve-outs and strong disclaimers/limits; UK favors narrow grants, reasonableness standards, and explicit UK GDPR compliance.
  • Lock in risk controls: clarify ownership of models/outputs, tailor warranties and indemnities to provenance and privacy risks, and include audit, suspension, and deletion/quarantine obligations for enforcement.

Example Sentences

  • The training-use grant allows us to copy, tokenize, and fine-tune models on the Dataset, but the productization carve-out prohibits deploying those models in our customer portal.
  • Our UK-form clause limits use to the Permitted Purpose—internal model training and evaluation—save as expressly stated, with outputs licensed only for internal testing.
  • Under the US form, the licensor retains all rights in the Data, while we own the derivative model’s weights, subject to a bright-line ban on weight sharing.
  • The agreement separates data rights, model rights, and output rights, and it states that training does not assign any exclusive rights in the Dataset.
  • Post-termination, any checkpoints trained solely on Licensor Data must be quarantined or deleted, with a written attestation within ten business days.

Example Dialogue

Alex: Legal sent back the data license—what does the productization carve-out actually block?

Ben: We can ingest, copy, and train internally, but we can’t deploy the trained model or sell its outputs without a separate commercialization addendum.

Alex: Do we at least own the fine-tuned weights?

Ben: Yes, under the US draft we own the derivative model, but there’s a no–weight sharing restriction and strict output safeguards.

Alex: And in the UK version?

Ben: The grant is narrower—use only for the Permitted Purpose—and any commercialisation requires a new schedule, plus UK GDPR duties like DPIAs and lawful basis are spelled out.

Exercises

Multiple Choice

1. Which clause best reflects a productization carve-out in a training-use license?

  • Licensee may copy, tokenize, and fine-tune the Dataset for internal evaluation only; deployment of trained models in customer-facing products is prohibited absent a separate commercial license.
  • Licensee may use the Dataset for any lawful purpose, including selling outputs to third parties, provided proper attribution is given.
  • Licensee owns the Dataset and may redistribute it with trained weights to accelerate ecosystem adoption.
Show Answer & Explanation

Correct Answer: Licensee may copy, tokenize, and fine-tune the Dataset for internal evaluation only; deployment of trained models in customer-facing products is prohibited absent a separate commercial license.

Explanation: A productization carve-out allows internal training activities but withholds permission to deploy trained models or commercialize outputs without a separate agreement.

2. In distinguishing rights layers, which statement is most accurate?

  • Data rights and model rights are the same because trained weights are just copies of the Dataset.
  • Data rights govern the Dataset; model rights govern trained parameters and artifacts; output rights govern use of generated text, images, code, or predictions.
  • Output rights always belong to the licensor if the data was proprietary.
Show Answer & Explanation

Correct Answer: Data rights govern the Dataset; model rights govern trained parameters and artifacts; output rights govern use of generated text, images, code, or predictions.

Explanation: The lesson emphasizes clean boundaries among data rights, model rights, and output rights, as each triggers different legal regimes and restrictions.

Fill in the Blanks

Under the US drafting style, licenses often feature broad grants with explicit ___ for deployment, weight sharing, and redistribution.

Show Answer & Explanation

Correct Answer: carve-outs

Explanation: US forms frequently rely on expansive permissions followed by conspicuous carve-outs that withhold productization and other activities.

The UK approach limits use to the Permitted Purpose and foregrounds compliance with ___, including DPIAs and lawful international transfers.

Show Answer & Explanation

Correct Answer: UK GDPR

Explanation: UK drafting emphasizes statutory data protection compliance, specifically referencing UK GDPR and related obligations.

Error Correction

Incorrect: The license lets us productize by default as long as we only trained internally.

Show Correction & Explanation

Correct Sentence: The license permits internal training, but productization is prohibited unless expressly licensed in a separate agreement.

Explanation: A training-use grant does not imply productization rights; productization is carved out and requires explicit extension.

Incorrect: Because we trained the model, we automatically own all rights in the Dataset and may redistribute it.

Show Correction & Explanation

Correct Sentence: Training does not transfer rights in the Dataset; we may own the trained model subject to restrictions, but the licensor retains all rights in the Data and redistribution is prohibited.

Explanation: The lesson separates data rights from model rights. Ownership of trained models does not confer ownership of or redistribution rights in the Dataset.