TL;DR

Thorsten Meyer AI’s latest Control Series article argues that data has become the AI industry’s hardest chokepoint as public web text nears practical limits. The piece cites Epoch AI projections, Anthropic’s authors settlement and growing controls over expert, enterprise and sovereign datasets.

Thorsten Meyer AI’s AI Dispatch has published Part 3 of its Control Series, identifying data as the next AI chokepoint after public-web training material neared practical limits and high-value datasets moved behind contracts, lawsuits and government control. The piece matters because it shifts the AI power debate from who can rent GPUs to who owns data no rival can easily copy.

The analysis cites Epoch AI’s estimate that the public internet contains about 300 trillion tokens of high-quality text, with frontier model training sets already approaching that level. Epoch’s projection, as described in the source material, places full use of public human text between 2026 and 2032, with a median around 2028. That timing is a forecast, not a confirmed endpoint.

The article says synthetic data has become a partial response. It points to Nvidia’s $320 million purchase of Gretel and Microsoft’s use of hundreds of billions of synthetic tokens as examples of how labs are trying to stretch training supply. But the source also warns that machine-generated training material can compound errors in domains where answers are hard to verify, increasing the value of fresh human and expert-made data.

The legal shift is central to the piece. AI Dispatch cites Anthropic’s $1.5 billion settlement with authors over pirated books, described in the source material as covering about 500,000 works at roughly $3,000 each. The settlement addresses past piracy claims and requires destruction of the pirated files, but it does not settle future training rules or model-output disputes. The New York Times case against OpenAI is still moving through discovery, while some publishers have turned to licensing deals.

AI Dispatch · The Control Series · Part 3

Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑

Sovereign / real-world

Avengers combat data · FSD · ISR

can’t be bought

Expert-authored

PhDs, lawyers, surgeons define “good”

the new gold

Licensed content

paywalled, deal-only — now priced

fenced

Public web text

scraped for free — exhausting ~2028

commoditizing

~300T

public text tokens — used up 2026–2032

$1.5B

Anthropic authors settlement — scraping era ends

$14.3B

Meta for 49% of Scale — triggered an exodus

keep the model

Ukraine’s condition — data as sovereign asset

The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.

thorstenmeyerai.com · 03 / 06

Data Ownership Reshapes AI Power

The report’s core point is that compute can be rented, while exclusive data cannot be recreated from the open web. If frontier models become closer in capability and chips become cheaper to access, unique training material may become the harder advantage to match.

That shift affects companies, creators and governments. Large AI labs may be better placed to pay for licensed archives, expert feedback and private corpora. Startups could face higher entry costs. Enterprises also face a strategic risk: proprietary customer, product or workflow data can help a vendor improve systems that may later compete with the data owner.

Amazon

synthetic training data generator

As an affiliate, we earn on qualifying purchases.

From Scraped Web To Private Sets

The new article follows Part 2 of the Control Series, which focused on compute and said H100 rental rates had fallen 60% to 75% from their peak. Part 3 argues that falling compute costs make data scarcity more visible.

The source separates data into layers: public web text, licensed content, expert-authored material and sovereign or real-world data. It cites Meta’s reported $14.3 billion deal for a 49% stake in Scale AI as an example of how data supply chains can unsettle customers. It also cites Ukraine’s reported condition that battlefield data arrangements keep the model under Ukrainian control, framing some data as a national asset rather than a normal commercial input.

“Data was supposed to be the abundant input. It’s the scarce one.”
— AI Dispatch

Training Data for Machine Learning: Human Supervision from Annotation to Data Science

As an affiliate, we earn on qualifying purchases.

Data Limits Still Need Proof

Several points remain unsettled. The date when public text becomes fully used is a projection, and better algorithms may change how much data labs need. The long-term usefulness and risk of synthetic data also remain disputed.

Copyright law is still developing. Anthropic’s settlement covers past piracy claims, not future training practices or model outputs. Ongoing cases and private licensing deals may set the next boundaries.

Static Cling Oil Change Stickers for Windshield, 120 Pcs 2x2 Inches No Glue No Residue Clear Next Service Due Reminder Labels for Car Window, Upgraded Transparent Removable Auto Maintenance Sticker

Static Cling Oil Change Stickers for Windshield, 120 Pcs 2×2 Inches No Glue No Residue Clear Next Service Due Reminder Labels for Car Window, Upgraded Transparent Removable Auto Maintenance Sticker

Static Cling, 100% Glue Free: Adheres steadily to windshields with static cling material. No glue, no sticky residue…

As an affiliate, we earn on qualifying purchases.

Court Cases And Data Deals

The next milestones are likely to come from litigation, publisher licensing contracts, expert-data markets and enterprise AI terms. Companies using AI vendors will watch whether their private data can be used to train provider models. Governments may also write tighter rules for defense, health and other sensitive datasets.

Amazon

AI dataset licensing software

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main news in this analysis?

AI Dispatch published a new Control Series article arguing that data has become the key AI chokepoint as public web text nears limits and valuable datasets move behind legal, commercial and state controls.

Is public web training data already exhausted?

No confirmed exhaustion date is given. The article cites Epoch AI projections that public human text may be fully used between 2026 and 2032, with a median around 2028.

Why did the Anthropic settlement matter?

The $1.5 billion settlement showed that training-data disputes can carry large costs. It also reinforced a split between legally acquired material and pirated files, while leaving future training and output questions unresolved.

Why can’t companies simply rent better data?

Some data is exclusive: enterprise records, expert work, operational logs, defense information and other private datasets. A rival can rent similar compute, but it cannot rent data that only one organization owns.

Source: Thorsten Meyer AI

Wellness content on this site is informational and not a substitute for professional medical guidance.

Data: The One Thing You Can’t Rent

Up next

I’m a Makeup-Obsessed Editor—12 Products I’m Retiring This Summer and 12 I’m Using Instead

Author

The Blissful Studio Team

Data: The One Thing You Can’t Rent