Licensing deals between AI companies and large publishers may be bad for pretty much everyone, especially everyone who does not directly receive a check from them.

Although the initial copyright lawsuits from large content companies like Getty Images and music labels are still very much ongoing (with new ones being filed regularly), recently we’ve also seen a series of licensing deals between large content owners and AI companies.

Setting aside the wisdom of the deal for any individual content company, I worry that these licensing deals represent a bad outcome for just about everyone else. Most of the companies entering into these agreements represent a relatively large amount of cultural power (that can be leveraged into public pressure), and a relatively small corpus of works (relative to the amount of works required to train a model), backed up with enough legal power to qualify as a plausible threat to an AI company. That puts them in a position to demand compensation that is out of proportion to their actual contribution to any given model.

The deals that flow from this dynamic allow a small number of companies to claim a disproportionate amount of compensation for their relatively modest contributions to a training dataset. In doing so, the licenses establish a precedent that may undermine the fair use defense for unlicensed training of models, making it harder for smaller competitors to enter the AI market.

This might be a positive development if these deals also increased the likelihood that everyone who created data used to train models would receive significant compensation.* However, these deals likely marginally decrease the likelihood of that outcome by allowing the media companies signing these deals to soak up most of the available licensing dollars before the vast majority of people and companies who created data in the training datasets are involved. The most likely outcome could be one similar to Spotify, where large record labels and a handful of high-profile artists receive significant compensation, while everyone else receives fractions of pennies (or no pennies).

Licensing Dollar Roll Up

It is easy for anyone who wants to be paid a licensing fee by AI model trainers to see these deals as a positive development. They may set a precedent that data must be licensed, and a market rate for data that applies to everyone else.

However, at this stage there does not appear to be any reason to see these deals as setting a standard for anything other than large (or large-ish) media companies and rightsholders. These deals do not set benchmarks for independent artists, or for anyone without the existing cultural and legal clout to demand them. After all, the terms of these deals aren’t even public.

It may be better to understand these deals as the large media companies and rightsholders jumping to the front of the line in order to soak up as much available licensing money as possible. Their incentive is to maximize the percentage of the licensing pool that they receive - not to set a standard on behalf of everyone else, or to grow the pie for others. In fact, every dollar of value that someone outside of the deal can claim is a dollar the large media companies cannot include in their own deal with the AI companies.

The result is that the large media companies leverage “creators should be paid” rhetoric to roll up all of the available licensing dollars, while making it marginally harder for anyone else to be paid for being part of the training data.

Which seems bad! As a bonus, these deals may undermine the fair use defense that allows the models to be created in the first place.

Blocking Competition

The copyright lawsuits over data used to train models all turn on whether or not the training is covered by fair use. If the act of training models on data is fair use, the trainers do not need permission from the data rightsholders (I think this is both the better reading of the law and the better policy outcome). If the act of training is not fair use, the trainers will need permission from every rightsholder of every bit of data they use to train their models.

Determining fair use involves applying a four factor test, one of which is the effect of the use on the potential market for the data. I’m confident that the AI company lawyers are crafting these agreements with an eye towards avoiding establishing a market for AI training data (available public information on the deals suggest that they are framed in terms of making it easier to access the data through APIs or other bulk data transfers, not a license to the data itself). Nonetheless, the existence of these deals does probably marginally increase the likelihood that courts would decide that there is a functioning market for licensing training data.

If that were the case, and courts found that the majority of the other fair use factors pushed against a finding of fair use, that would mean that only companies with enough money to license training data at scale could train new AI models. I think this would probably be a bad policy outcome because it could effectively block new market entrants in AI. And working out the licensing process would be somewhere between complicated and impossible.

All of which makes these deals bad for pretty much everyone. They are bad for any creators who are not being directly paid by them, bad for anyone who would welcome new competition in AI, and bad for anyone who generally thinks that non-consumptive uses of information on the internet should be protected by fair use.

*I currently believe that compensating everyone who created data used to train models is a bad idea, but I understand why it is an attractive option to many people.

Hero Image: A nun frightened by a ghost playing a guitar; page 65 from the “Images of Spain” Album (F)

Is There A Coherent Theory of Attributing AI Training Data?

It feels like any time I have a conversation about attributing data used to train AI models, the completely understandable impulse to wan...… Continue reading