Licensing Deals Between AI Companies and Large Publishers are Probably Bad

Licensing deals between AI companies and large publishers may be bad for pretty much everyone, especially everyone who does not directly receive a check from them.

Although the initial copyright lawsuits from large content companies like Getty Images and music labels are still very much ongoing (with new ones being filed regularly), recently we’ve also seen a series of licensing deals between large content owners and AI companies.

Setting aside the wisdom of the deal for any individual content company, I worry that these licensing deals represent a bad outcome for just about everyone else. Most of the companies entering into these agreements represent a relatively large amount of cultural power (that can be leveraged into public pressure), and a relatively small corpus of works (relative to the amount of works required to train a model), backed up with enough legal power to qualify as a plausible threat to an AI company. That puts them in a position to demand compensation that is out of proportion to their actual contribution to any given model.

The deals that flow from this dynamic allow a small number of companies to claim a disproportionate amount of compensation for their relatively modest contributions to a training dataset. In doing so, the licenses establish a precedent that may undermine the fair use defense for unlicensed training of models, making it harder for smaller competitors to enter the AI market.

This might be a positive development if these deals also increased the likelihood that everyone who created data used to train models would receive significant compensation.* However, these deals likely marginally decrease the likelihood of that outcome by allowing the media companies signing these deals to soak up most of the available licensing dollars before the vast majority of people and companies who created data in the training datasets are involved. The most likely outcome could be one similar to Spotify, where large record labels and a handful of high-profile artists receive significant compensation, while everyone else receives fractions of pennies (or no pennies).

Licensing Dollar Roll Up

It is easy for anyone who wants to be paid a licensing fee by AI model trainers to see these deals as a positive development. They may set a precedent that data must be licensed, and a market rate for data that applies to everyone else.

However, at this stage there does not appear to be any reason to see these deals as setting a standard for anything other than large (or large-ish) media companies and rightsholders. These deals do not set benchmarks for independent artists, or for anyone without the existing cultural and legal clout to demand them. After all, the terms of these deals aren’t even public.

It may be better to understand these deals as the large media companies and rightsholders jumping to the front of the line in order to soak up as much available licensing money as possible. Their incentive is to maximize the percentage of the licensing pool that they receive - not to set a standard on behalf of everyone else, or to grow the pie for others. In fact, every dollar of value that someone outside of the deal can claim is a dollar the large media companies cannot include in their own deal with the AI companies.

The result is that the large media companies leverage “creators should be paid” rhetoric to roll up all of the available licensing dollars, while making it marginally harder for anyone else to be paid for being part of the training data.

Which seems bad! As a bonus, these deals may undermine the fair use defense that allows the models to be created in the first place.

Blocking Competition

The copyright lawsuits over data used to train models all turn on whether or not the training is covered by fair use. If the act of training models on data is fair use, the trainers do not need permission from the data rightsholders (I think this is both the better reading of the law and the better policy outcome). If the act of training is not fair use, the trainers will need permission from every rightsholder of every bit of data they use to train their models.

Determining fair use involves applying a four factor test, one of which is the effect of the use on the potential market for the data. I’m confident that the AI company lawyers are crafting these agreements with an eye towards avoiding establishing a market for AI training data (available public information on the deals suggest that they are framed in terms of making it easier to access the data through APIs or other bulk data transfers, not a license to the data itself). Nonetheless, the existence of these deals does probably marginally increase the likelihood that courts would decide that there is a functioning market for licensing training data.

If that were the case, and courts found that the majority of the other fair use factors pushed against a finding of fair use, that would mean that only companies with enough money to license training data at scale could train new AI models. I think this would probably be a bad policy outcome because it could effectively block new market entrants in AI. And working out the licensing process would be somewhere between complicated and impossible.

All of which makes these deals bad for pretty much everyone. They are bad for any creators who are not being directly paid by them, bad for anyone who would welcome new competition in AI, and bad for anyone who generally thinks that non-consumptive uses of information on the internet should be protected by fair use.

*I currently believe that compensating everyone who created data used to train models is a bad idea, but I understand why it is an attractive option to many people.

Hero Image: A nun frightened by a ghost playing a guitar; page 65 from the “Images of Spain” Album (F)

Is There A Coherent Theory of Attributing AI Training Data?

It feels like any time I have a conversation about attributing data used to train AI models, the completely understandable impulse to want attribution starts to break when confronted with some practical implementation questions.

This is especially true in the context of training data that comes from open communities. These communities rely on some sort of open license that requires attribution (say, CC BY or MIT), and have been built on a set of norms that place a high value on attribution. Regardless of whether or not complying with the license is legally required in these scenarios, many members of the community view attribution as a key element of its social contract.

Since is already solving a number of problems in these communities, it is not hard to imagine a situation where attribution could solve some additional social and political problems related to the growth of these models. These problems tend to be most acute around LLMs and generative AI, but are also relevant to a broader set of AI/ML models.

This post is my attempt to describe some of the practical implementation issues that come to mind when talking about attribution and AI training datasets. It is not intended to be a list of things that advocates for attribution “must fix” before attribution makes sense, or a list of reasons why attribution is impossible. It also is not intended to advocate for or against the value or legal necessity of attribution for AI training data.

Instead, it is a list of things that I would like to figure out before being convinced that attribution is something worth pursuing. At the end, it also flags one lesson from existing open source software licensing that makes me somewhat more skeptical of this approach.

The Simple Model

Let’s start with the simple model of how large, foundational models are trained.

If you wanted to train one of these models, you might start with something like Common Crawl, a repository of 2.7 billion web pages, or LAION-400-Million, a collection of 400 million images paired with their captions. You could throw these datasets into the pot with millions of dollars of computational power, plus some time, and get your very own AI model (again, this is the simple model of how all of this works).

You know all of the data and datasets you used to train the model, so when you release the model you include a readme.txt file that has 3 billion or so entries listing all of the sources you used. Problem solved?

What Works About This Approach

You have given attribution! Each bit of information you used to train the model is right there in the list, alongside the 3 billion other things that went into the pot.

What Doesn’t Work About This Approach

Maybe this solves the problem? Or maybe not?

Everything maps to everything

One problem with this approach is that an undifferentiated list of all of the training data might not be the kind of attribution people are looking for. On some level, if you are training a model from scratch, every data point contributes equal to the creation of that model and every given output of the model.

However, it is also easy to come to an intuitive belief that some data might be more important than others when it comes to a specific output (there are also more quantitative ways one might come to this conclusion). Also, what if it is possible to have a model unlearn a specific bit of data and then continue to perform the same way it performed before unlearning that bit of data? If your model starts generating poetry, the training data that is just a list of calculus problems and solutions might feel less important than pages of poems.

Does that mean that your attribution should prioritize some data points over others when it comes to specific outputs? How could you begin to rank this? Is there a threshold below which something shouldn’t be listed at all? Is there a point where some training data should get so much credit that it gets some sort of special recognition? Would you make this evaluation on a per-model basis, or on a per-output basis?

Is this meaningful attribution?

Being listed as one of 3 billion data points used to train a model is attribution in a literal sense, but is it attribution in a meaningful sense? Should that distinct matter?

The Creative Commons license requires that a licensor give “appropriate credit,” which is defined mostly in the information it contains, not the form that attribution takes. The CC Wiki contains further best practices for attribution, while being upfront that “Because each use case is different, you can decide what form of attribution is most suitable for your specific situation.”

Regardless of whether or not the 3 billion line readme.txt file is legally compliant with a CC license, there may be a significant number of creators who feel that it does not meaningfully address their wishes. Of course, it will also always be impossible to address the wishes of all 3 billion creators anyway. To the extent that attribution is a social/political solution instead of a legal solution, not addressing the wishes of some critical mass of those creators will significantly reduce its utility. Open Future’s Alignment Assembly on AI and the Commons is interesting to consider in this context. Regardless, if listing people in a 3 billion line file does not meet the expectations of a critical mass of people, is it worth imposing as an expectation?

There is a similar, yet also somewhat distinct, version of this question when it comes to open source software. Under most open source software licenses, attribution is including specific text in a file bundled with the code. However, as discussed more at the end of this post, even that can be of questionable utility at scale.

Is this just a roadmap for lawsuits?

Is disclosing everything you used to train your model just a roadmap for lawsuits? This question is a bit harder to answer. IF training models require the permission of the rightsholder AND the 3 billion entry readme.txt does not meet a permissive license’s attribution requirements, then training on data that requires attribution without providing that attribution is infringement. But that’s a big IF.

If it is fair use to train AI models on unlicensed data, listing the training data won’t be a roadmap for lawsuits because creators don’t have a copyright claim to bring against trainers.

Also, current experience suggests that lawsuits aren’t waiting for this particular roadmap anyway.

That means that the roadmap question may be best answered by examining the underlying fair use question, not by weighing the value of attribution. And the fair use fate of AI training is probably much more relevant to it than the attribution compliance question. So I’m going to call this concern both worth flagging and out of scope of this post.

The More Complicated Models

The simple model is just that - a simplified way of thinking about model training. Some of the things that happen in real-world training further complicate things (for a great illustration of many of these relationships, check out Christo Buschek & Jer Thorp’s Models All The Way Down).

Models Trained by Models

As these models are trained on large datasets, it probably should not come as a surprise that trainers are using other AI models to structure, parse, and prepare datasets before they are used to train the model. These structuring, parsing, and preparing models are contributing to the final output model, at least in some ways.

Should the attribution for the final model include all of the data used to train those models as well? Should it differentiate between data used to train the “helper” models and the primary models? How recursively should that obligation extend? What should it mean if the users of the trainer models do not have access to the data used to train those models?

Models Tuned by Models

Large, foundational models are now being tuned to do all sorts of specific tasks. This tuning can include building on an open model like Llama, or importing your own dataset to build a custom RAG.

In cases like this, should attribution include all of the data from the foundational model, plus all of the customized training data? If someone is doing their own tuning without access to a complete list of the training data, is releasing the list of the data they used enough? Should tuning a model built on attributed data create an obligation to release your own tuning data?

Models Optimized by People Who Built Other Models

To me, this is one of the most conceptually interesting situations, and one that highlights the various ways that “learn” is used in these conversations.

The first two scenarios in this section describe some sort of direct lineage between models. However, there are also less direct, more human-mediated linkages. If a team builds model A, they will bring whatever they learned in that process to building model B. That’s true even if model A isn’t “used” to build model B in some sort of literal sense. Should they still attribute the data they used to gain the human knowledge from building model A in the release notes for model B?

This is not a hypothetical scenario. For example, OpenAI describes a “research path” linking GPT, GPT-2, GPT-3, and GPT4, explaining that they take what they learn in developing each model and apply it to the next one:

A year ago, we trained GPT-3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time.

If your data was used to create GPT-3.5, which taught OpenAI something about training models that allow them not to have to use your data to train GPT-4, have you still made a creditable contribution to GPT-4? How far back should we follow this chain of logic?

Lessons Learned from Open Source Software

AI is not the first time that open communities have had to wrestle with the limits of attribution requirements, or with long, nested attribution documents. While there are examples of these across open communities, the most relevant examples probably come from open source software.

Software is made of other software. Most pieces of software contain a number of other pieces of open source software with licenses that require attribution. In practice, this takes the form of super long text files tucked into software releases.

As Kate Downing points out, the value of the attribution requirement in modern software seems to be pretty low. Software may come with pages and pages of attributions that comply with the license requirement (Downing links to a not-particularly-unique example where the attribution document is 15,385 pages long), but it is unclear if the existence of that document is much more than a compliance chore for a responsible maintainer.

To give another illustration of how long and recursive these documents can be, while drafting this post I also had to update the bios on my printer. The tool I was using required me to agree to the license, and then listed what would have been many pages of attribution. This happened to be the last attribution listed (and therefore the only one I actually saw):

screenshot of license attribution text crediting 'Ty Coon, President of Vice' dated 1 April 1989

How important is the attribution requirement to Ty Coon, President of Vice and creator of some component of my printer driver update utility? Hard to say. update 7/15/24: as pointed out by Luis Villa on mastodon, Ty Coon is in the sample attribution statement at the bottom of earlier versions of the GPL. This occasionally creates confusion outside of this blog. The one snippet of the code I saw when using the utility was not even the (fake) name of one of its many contributors - it was just the end of the appendix to the license they used, pushing their actual name outside of my terminal window.

Given this history, would it be productive to import this practice into training AI models? I’m skeptical.

update 7/17/24: Back in 2016, Luis Villa also wrote a series of blog posts about a very similar attribution problem that argued against trying to use copyleft for databases. If you made it this far, they are very much worth checking out, because most of those problems reappear in the context of AI.

Hero Image: Pidgeon Hole. A Convent Garden Contrivance to Coop up the Gods

Clearing Rights for a 'Non-Infringing' Collection of AI Training Media is Hard

In response to a number of copyright lawsuits about AI training datasets, we are starting to see efforts to build ‘non-infringing’ collections of media for training AI. While I continue to believe that most AI training is covered by fair use in the US and therefore inherently ‘non-infringing’, I think these efforts to build ‘safe’ or ‘clean’ or whatever other word one might use data sets are quite interesting. One reason they are interesting is that they can help illustrate why trying to build such a data set at scale is such a challenge.

That’s why I was excited to read about Source.Plus (via a post from Open Future). Source.Plus is a tool from Spawning that purports to aggregate over 37 million “public domain and CC0 images integrated from dozens of libraries and museums.” That’s a lot less than are used to train current generative models, but still a lot of images that could be used for all sorts of useful things.

However, it didn’t take too much poking around on the site to find an illustration of why accurately aggregating nominally openly licensed images at scale can be such a challenge.

The site has plenty of OpenGLAM images that are clearly old enough to be in the public domain. It also has a number of newer images (like photographs) that are said to be licensed under CC0. Curious, I clicked on the first photograph I found on the Source.Plus home page:

photograph of a library reading room full of patrons shot from above

According to the image page on Source.Plus, the image was from Wikimedia Commons and licensed under a CC0 public domain dedication. It listed the creator as Pixabay and the uploader (to Wikimedia) as Philipslearning.

Clicking through to the wikimedia page reveals that the original source for the image was Pixabay, and that it was uploaded on March 9, 2023 by Philipslearning (an account that appears to no longer exist, for whatever that is worth). The file metadata says that the image itself was taken on May 18, 2016.

Clicking through to the Pixabay page for the image reveals that the image is available under the Pixabay Content License. That license is fairly permissive, but does state:

  • You cannot sell or distribute Content (either in digital or physical form) on a Standalone basis. Standalone means where no creative effort has been applied to the Content and it remains in substantially the same form as it exists on our website.
  • If Content contains any recognisable trademarks, logos or brands, you cannot use that Content for commercial purposes in relation to goods and services. In particular, you cannot print that Content on merchandise or other physical products for sale.
  • You cannot use Content in any immoral or illegal way, especially Content which features recognisable people.
  • You cannot use Content in a misleading or deceptive way.
  • You cannot use any of the Content as part of a trade-mark, design-mark, trade-name, business name or service mark.

Which is to say, not CC0.

However, further investigation into the Pixabay Wikipedia page suggests that images uploaded to Pixabay before January 9, 2019 are actually released under CC0. Section 4 of the Pixabay terms confirms that. The additional information on the image’s Pixabay page confirms that it was uploaded on May 17, 2016 (which matches the metadata added by the unknown Philipslearning on the image’s wikimedia page).

All of which means that this image is, in all likelihood, available under a CC0 public domain dedication. Which is great! Everything was right!

At the same time, the accuracy of that status feels a bit fragile. This fragility works in the context of wikipedia, or if you are looking for a handful of openly-licensed images. Is it likely to hold up at training set scale across tens of millions of images? Maybe? What does it mean to be ‘good enough’ in this case? If trainers do require permission from rightsholders to train, and one relied on Source.Plus/wikimedia for the CC0 status of a work, and that status turned out to be incorrect, should the fact that they thought they were using a CC0 image be relevant to their liability?

Hero Image: Interieur van de Bodleian Library te Oxford

Make Government-Funded Hardware Open Source by Default

Earlier this year the Federation of American Scientists (FAS), Center for Open Science, and the Wilson Center held an open science policy sprint to source and develop actionable policy ideas aimed at improving scientific transparency, equity, and innovation. Some heroic editing from the FAS team (especially Jordan Dworkin and Grace Wickerson) helped transform “uh, if the government pays for hardware it should be open source” into the actual proposal below. You can see the original version in situ here.

While scientific publications and data are increasingly made publicly accessible, designs and documentation for scientific hardware — another key output of federal funding and driver of innovation — remain largely closed from view. This status quo can lead to redundancy, slowed innovation, and increased costs. Existing standards and certifications for open source hardware provide a framework for bringing the openness of scientific tools in line with that of other research outputs. Doing so would encourage the collective development of research hardware, reduce wasteful parallel creation of basic tools, and simplify the process of reproducing research. The resulting open hardware would be available to the public, researchers, and federal agencies, accelerating the pace of innovation and ensuring that each community receives the full benefit of federally funded research.

Federal grantmakers should establish a default expectation that hardware developed as part of federally supported research be released as open hardware. To retain current incentives for translation and commercialization, grantmakers should design exceptions to this policy for researchers who intend to patent their hardware.

Details

Federal funding plays an important role in setting norms around open access to research. The White House Office of Science and Technology Policy (OSTP)’s recent Memorandum Ensuring Free, Immediate, and Equitable Access to Federally Funded Research makes it clear that open access is a cornerstone of a scientific culture that values collaboration and data sharing. OSTP’s recent report on open access publishing further declares that “[b]road and expeditious sharing of federally funded research is fundamental for accelerating discovery on critical science and policy questions.”

These efforts have been instrumental in providing the public with access to scientific papers and data — two of the foundational outputs of federally funded research. Yet hardware, another key input and output of science and innovation, remains largely hidden from view. To continue the move towards an accessible, collaborative, and efficient scientific enterprise, public access policies should be expanded to include hardware. Specifically, making federally funded hardware open source by default would have a number of specific and immediate benefits:

Reduce Wasteful Reinvention. Researchers are often forced to develop testing and operational hardware that supports their research. In many cases, unbeknownst to those researchers, this hardware has already been developed as part of other projects by other researchers in other labs. However, since that original hardware was not openly documented and licensed, subsequent researchers are not able to learn from and build upon this previous work. The lack of open documentation and licensing is also a barrier to more intentional, collaborative development of standardized testing equipment for research.

Increase Access to Information. As the OSTP memo makes clear, open access to federally funded research allows all Americans to benefit from our collective investment. This broad and expeditious sharing strengthens our ability to be a critical leader and partner on issues of open science around the world. Immediate sharing of research results and data is key to ensuring that benefit. Explicit guidance on sharing the hardware developed as part of that research is the next logical step towards those goals.

Alternative Paths to Recognition. Evaluating a researcher’s impact often includes an assessment of the number of patents they can claim. This is in large part because patents are easy to quantify. However, this focus on patents creates a perverse incentive for researchers to erect barriers to follow on study even if they have no intention of using patents to commercialize their research. Encouraging researchers to open source the hardware developed as part of their research creates an alternative path to evaluate their impact, especially as those pieces of open source hardware are adopted and improved by others. Uptake of researchers’ open hardware could be included in assessments on par with any patented work. This path recognizes the contribution to a collective research enterprise.

Verifiability. Open access to data and research are important steps towards allowing third parties to verify research conclusions. However, these tools can be limited if the hardware used to generate the data and produce the research are not themselves open. Open sourcing hardware simplifies the process of repeating studies under comparable conditions, allowing for third-party validation of important conclusions.

Recommendations

Federal grantmaking agencies should establish a default presumption that recipients of research funds make hardware developed with those funds available on open terms. This policy would apply to hardware built as part of the research process, as well as hardware that is part of the final output. Grantees should be able to opt out of this requirement with regards to hardware that is expected to be patented; such an exception would provide an alternative path for researchers to share their work without undermining existing patent-based development pathways.

To establish this policy, OSTP should conduct a study and produce a report on the current state of federally funded scientific hardware and opportunities for open source hardware policy.

  • As part of the study, OSTP should coordinate and convene stakeholders to discuss and align on policy implementation details — including relevant researchers, funding agencies, U.S. Patent and Trademark Office officials, and leaders from university tech transfer offices.

  • The report should provide a detailed and widely applicable definition of open source hardware, drawing on definitions established in the community — in particular, the definition maintained by the Open Source Hardware Association, which has been in use for over a decade and is based on the widely recognized definition of open source software maintained by the Open Source Initiative.

  • It should also lay out a broadly acceptable policy approach for encouraging open source by default, and provide guidance to agencies on implementation. The policy framework should include recommendations for:

    • Minimally burdensome components of the grant application and progress report with which to capture relevant information regarding hardware and to ensure planning and compliance for making outputs open source
    • A clear and well-defined opportunity for researchers to opt out of this mandate when they intend to patent their hardware

The Office of Management and Budget (OMB) should issue a memorandum establishing a policy on open source hardware in federal research funding. The memorandum should include:

  • The rationale for encouraging open source hardware by default in federally funded scientific research, drawing on the motivation of public access policies for publications and data

  • A finalized definition of open source hardware to be used by agencies in policy implementation

  • The incorporation of OMB’s Open Source Scientific Hardware Policy, in alignment with the OSTP report and recommendations

Conclusion

The U.S. government and taxpayers are already paying to develop hardware created as part of research grants. In fact, because there is not currently an obligation to make that hardware openly available, the federal government and taxpayers are likely paying to develop identical hardware over and over again.

Grantees have already proven that existing open publication and open data obligations promote research and innovation without unduly restricting important research activities. Expanding these obligations to include the hardware developed under these grants is the natural next step.

Hero image: Crop of Andrew Carnegie, Smithsonian Open Access Collection

Licenses are Not Proxies for Openness in AI Models

Earlier this year, the National Telecommunications and Information Administration (NTIA) requested comment on a number of questions related to what it is calling “open foundational models.” This represents the US Government starting to think about what “open” means in the context of AI and machine learning.

The definition of open in the context of AI and machine learning is more complicated than it is in software, and I assume that many people are going to submit many interesting comments as part of the docket.

I also submitted a short comment. It focused on a comparatively narrow issue: whether or not it makes sense to use licenses as an easy way to test for openness in the context of AI models. I argued that it does not, at least not right now.

There are many situations where licenses are used as proxies for “open”. A funder might require all software to be released under an OSI-approved open source software license, or that a journal article be released under a Creative Commons license. In these cases, using the license is essentially an easy way to confirm that the thing being released really is open.

At a basic level, these systems work because of two things: 1) the thing being licensed is relatively discrete, and 2) the licenses used are mature and widely adopted within the community.

Open source hardware acts as a helpful contrast to these other examples. Unlike a software repo or journal article, what constitutes “hardware” can be complex - it might include the hardware itself, digital design files, documentation/instructions, and software. All of these may be packaged differently in different places.

Each of these elements may also have a different relationship with intellectual property protections, especially copyright. We have some mature open hardware licenses, but they are relatively recent and even they embody questions and gray areas related to what they do and do not control when it comes to any specific piece of hardware.

My comment suggests that open ML models are much more like open hardware than open software. The community does not really have a consensus definition of what “open” even means in the context of AI models (freely available weights? code? training data?), let alone how (and if) those elements might need to be licensed.

In light of this, it would be unwise to build a definition of open foundational models that would allow “just use this license” to be an easy way to comply. There might be a day when consensus definitions are established and licenses are mature. Until then, any definition of open should require a more complex analysis than simply looking at a license.

Header image: Physical Examination from the Smithsonian’s National Portrait Gallery