Clearing Rights for a 'Non-Infringing' Collection of AI Training Media is Hard

In response to a number of copyright lawsuits about AI training datasets, we are starting to see efforts to build ‘non-infringing’ collections of media for training AI. While I continue to believe that most AI training is covered by fair use in the US and therefore inherently ‘non-infringing’, I think these efforts to build ‘safe’ or ‘clean’ or whatever other word one might use data sets are quite interesting. One reason they are interesting is that they can help illustrate why trying to build such a data set at scale is such a challenge.

That’s why I was excited to read about Source.Plus (via a post from Open Future). Source.Plus is a tool from Spawning that purports to aggregate over 37 million “public domain and CC0 images integrated from dozens of libraries and museums.” That’s a lot less than are used to train current generative models, but still a lot of images that could be used for all sorts of useful things.

However, it didn’t take too much poking around on the site to find an illustration of why accurately aggregating nominally openly licensed images at scale can be such a challenge.

The site has plenty of OpenGLAM images that are clearly old enough to be in the public domain. It also has a number of newer images (like photographs) that are said to be licensed under CC0. Curious, I clicked on the first photograph I found on the Source.Plus home page:

photograph of a library reading room full of patrons shot from above

According to the image page on Source.Plus, the image was from Wikimedia Commons and licensed under a CC0 public domain dedication. It listed the creator as Pixabay and the uploader (to Wikimedia) as Philipslearning.

Clicking through to the wikimedia page reveals that the original source for the image was Pixabay, and that it was uploaded on March 9, 2023 by Philipslearning (an account that appears to no longer exist, for whatever that is worth). The file metadata says that the image itself was taken on May 18, 2016.

Clicking through to the Pixabay page for the image reveals that the image is available under the Pixabay Content License. That license is fairly permissive, but does state:

  • You cannot sell or distribute Content (either in digital or physical form) on a Standalone basis. Standalone means where no creative effort has been applied to the Content and it remains in substantially the same form as it exists on our website.
  • If Content contains any recognisable trademarks, logos or brands, you cannot use that Content for commercial purposes in relation to goods and services. In particular, you cannot print that Content on merchandise or other physical products for sale.
  • You cannot use Content in any immoral or illegal way, especially Content which features recognisable people.
  • You cannot use Content in a misleading or deceptive way.
  • You cannot use any of the Content as part of a trade-mark, design-mark, trade-name, business name or service mark.

Which is to say, not CC0.

However, further investigation into the Pixabay Wikipedia page suggests that images uploaded to Pixabay before January 9, 2019 are actually released under CC0. Section 4 of the Pixabay terms confirms that. The additional information on the image’s Pixabay page confirms that it was uploaded on May 17, 2016 (which matches the metadata added by the unknown Philipslearning on the image’s wikimedia page).

All of which means that this image is, in all likelihood, available under a CC0 public domain dedication. Which is great! Everything was right!

At the same time, the accuracy of that status feels a bit fragile. This fragility works in the context of wikipedia, or if you are looking for a handful of openly-licensed images. Is it likely to hold up at training set scale across tens of millions of images? Maybe? What does it mean to be ‘good enough’ in this case? If trainers do require permission from rightsholders to train, and one relied on Source.Plus/wikimedia for the CC0 status of a work, and that status turned out to be incorrect, should the fact that they thought they were using a CC0 image be relevant to their liability?

Hero Image: Interieur van de Bodleian Library te Oxford

Make Government-Funded Hardware Open Source by Default

Earlier this year the Federation of American Scientists (FAS), Center for Open Science, and the Wilson Center held an open science policy sprint to source and develop actionable policy ideas aimed at improving scientific transparency, equity, and innovation. Some heroic editing from the FAS team (especially Jordan Dworkin and Grace Wickerson) helped transform “uh, if the government pays for hardware it should be open source” into the actual proposal below. You can see the original version in situ here.

While scientific publications and data are increasingly made publicly accessible, designs and documentation for scientific hardware — another key output of federal funding and driver of innovation — remain largely closed from view. This status quo can lead to redundancy, slowed innovation, and increased costs. Existing standards and certifications for open source hardware provide a framework for bringing the openness of scientific tools in line with that of other research outputs. Doing so would encourage the collective development of research hardware, reduce wasteful parallel creation of basic tools, and simplify the process of reproducing research. The resulting open hardware would be available to the public, researchers, and federal agencies, accelerating the pace of innovation and ensuring that each community receives the full benefit of federally funded research.

Federal grantmakers should establish a default expectation that hardware developed as part of federally supported research be released as open hardware. To retain current incentives for translation and commercialization, grantmakers should design exceptions to this policy for researchers who intend to patent their hardware.

Details

Federal funding plays an important role in setting norms around open access to research. The White House Office of Science and Technology Policy (OSTP)’s recent Memorandum Ensuring Free, Immediate, and Equitable Access to Federally Funded Research makes it clear that open access is a cornerstone of a scientific culture that values collaboration and data sharing. OSTP’s recent report on open access publishing further declares that “[b]road and expeditious sharing of federally funded research is fundamental for accelerating discovery on critical science and policy questions.”

These efforts have been instrumental in providing the public with access to scientific papers and data — two of the foundational outputs of federally funded research. Yet hardware, another key input and output of science and innovation, remains largely hidden from view. To continue the move towards an accessible, collaborative, and efficient scientific enterprise, public access policies should be expanded to include hardware. Specifically, making federally funded hardware open source by default would have a number of specific and immediate benefits:

Reduce Wasteful Reinvention. Researchers are often forced to develop testing and operational hardware that supports their research. In many cases, unbeknownst to those researchers, this hardware has already been developed as part of other projects by other researchers in other labs. However, since that original hardware was not openly documented and licensed, subsequent researchers are not able to learn from and build upon this previous work. The lack of open documentation and licensing is also a barrier to more intentional, collaborative development of standardized testing equipment for research.

Increase Access to Information. As the OSTP memo makes clear, open access to federally funded research allows all Americans to benefit from our collective investment. This broad and expeditious sharing strengthens our ability to be a critical leader and partner on issues of open science around the world. Immediate sharing of research results and data is key to ensuring that benefit. Explicit guidance on sharing the hardware developed as part of that research is the next logical step towards those goals.

Alternative Paths to Recognition. Evaluating a researcher’s impact often includes an assessment of the number of patents they can claim. This is in large part because patents are easy to quantify. However, this focus on patents creates a perverse incentive for researchers to erect barriers to follow on study even if they have no intention of using patents to commercialize their research. Encouraging researchers to open source the hardware developed as part of their research creates an alternative path to evaluate their impact, especially as those pieces of open source hardware are adopted and improved by others. Uptake of researchers’ open hardware could be included in assessments on par with any patented work. This path recognizes the contribution to a collective research enterprise.

Verifiability. Open access to data and research are important steps towards allowing third parties to verify research conclusions. However, these tools can be limited if the hardware used to generate the data and produce the research are not themselves open. Open sourcing hardware simplifies the process of repeating studies under comparable conditions, allowing for third-party validation of important conclusions.

Recommendations

Federal grantmaking agencies should establish a default presumption that recipients of research funds make hardware developed with those funds available on open terms. This policy would apply to hardware built as part of the research process, as well as hardware that is part of the final output. Grantees should be able to opt out of this requirement with regards to hardware that is expected to be patented; such an exception would provide an alternative path for researchers to share their work without undermining existing patent-based development pathways.

To establish this policy, OSTP should conduct a study and produce a report on the current state of federally funded scientific hardware and opportunities for open source hardware policy.

  • As part of the study, OSTP should coordinate and convene stakeholders to discuss and align on policy implementation details — including relevant researchers, funding agencies, U.S. Patent and Trademark Office officials, and leaders from university tech transfer offices.

  • The report should provide a detailed and widely applicable definition of open source hardware, drawing on definitions established in the community — in particular, the definition maintained by the Open Source Hardware Association, which has been in use for over a decade and is based on the widely recognized definition of open source software maintained by the Open Source Initiative.

  • It should also lay out a broadly acceptable policy approach for encouraging open source by default, and provide guidance to agencies on implementation. The policy framework should include recommendations for:

    • Minimally burdensome components of the grant application and progress report with which to capture relevant information regarding hardware and to ensure planning and compliance for making outputs open source
    • A clear and well-defined opportunity for researchers to opt out of this mandate when they intend to patent their hardware

The Office of Management and Budget (OMB) should issue a memorandum establishing a policy on open source hardware in federal research funding. The memorandum should include:

  • The rationale for encouraging open source hardware by default in federally funded scientific research, drawing on the motivation of public access policies for publications and data

  • A finalized definition of open source hardware to be used by agencies in policy implementation

  • The incorporation of OMB’s Open Source Scientific Hardware Policy, in alignment with the OSTP report and recommendations

Conclusion

The U.S. government and taxpayers are already paying to develop hardware created as part of research grants. In fact, because there is not currently an obligation to make that hardware openly available, the federal government and taxpayers are likely paying to develop identical hardware over and over again.

Grantees have already proven that existing open publication and open data obligations promote research and innovation without unduly restricting important research activities. Expanding these obligations to include the hardware developed under these grants is the natural next step.

Hero image: Crop of Andrew Carnegie, Smithsonian Open Access Collection

Licenses are Not Proxies for Openness in AI Models

Earlier this year, the National Telecommunications and Information Administration (NTIA) requested comment on a number of questions related to what it is calling “open foundational models.” This represents the US Government starting to think about what “open” means in the context of AI and machine learning.

The definition of open in the context of AI and machine learning is more complicated than it is in software, and I assume that many people are going to submit many interesting comments as part of the docket.

I also submitted a short comment. It focused on a comparatively narrow issue: whether or not it makes sense to use licenses as an easy way to test for openness in the context of AI models. I argued that it does not, at least not right now.

There are many situations where licenses are used as proxies for “open”. A funder might require all software to be released under an OSI-approved open source software license, or that a journal article be released under a Creative Commons license. In these cases, using the license is essentially an easy way to confirm that the thing being released really is open.

At a basic level, these systems work because of two things: 1) the thing being licensed is relatively discrete, and 2) the licenses used are mature and widely adopted within the community.

Open source hardware acts as a helpful contrast to these other examples. Unlike a software repo or journal article, what constitutes “hardware” can be complex - it might include the hardware itself, digital design files, documentation/instructions, and software. All of these may be packaged differently in different places.

Each of these elements may also have a different relationship with intellectual property protections, especially copyright. We have some mature open hardware licenses, but they are relatively recent and even they embody questions and gray areas related to what they do and do not control when it comes to any specific piece of hardware.

My comment suggests that open ML models are much more like open hardware than open software. The community does not really have a consensus definition of what “open” even means in the context of AI models (freely available weights? code? training data?), let alone how (and if) those elements might need to be licensed.

In light of this, it would be unwise to build a definition of open foundational models that would allow “just use this license” to be an easy way to comply. There might be a day when consensus definitions are established and licenses are mature. Until then, any definition of open should require a more complex analysis than simply looking at a license.

Header image: Physical Examination from the Smithsonian’s National Portrait Gallery

Carlin AI Lawsuit Against 'Impression with Computer'

The brewing dispute over a (purportedly - more on that below) AI-generated George Carlin standup special is starting to feel like another step in the long tradition of rightsholders claiming that normal activity needs their permission when done with computers.

Computers operate by making copies, and copies are controlled by copyright law. As a result, more or less since the dawn of popular computing, rightsholders have attempted to use those copies to extend their control by eliminating the rights of others.

While you can read a physical book without caring what the publisher thinks, publishers insist that reading an ebook with computer needs a license because ereader software loads the file by making copies. Similarly, although record labels can’t control the sale of used records or CDs, they managed to sue the concept of selling used music with computer out of existence.

In this case, although impressions have probably existed since there was more than one human, the Carlin estate appears to be claiming that “impression with computer” needs special permission from them.

The Carlin Video

The subject of this dispute is a video of a computer generated George Carlin avatar doing an hour of new comedy in the style of Carlin. As framed by Dudsey (the comedy team behind the video), in order to create the new content Carlin was “resurrected by an AI to create more material.” (this “resurrection” is necessary because Carlin died in 2008).

While that framing may turn out to be inaccurate (although arguably artistically important to the purpose of the new work), the release of the video kicked off a week of “AI is coming for us” coverage of various flavors, followed by a lawsuit from the Carlin estate.

Is This New?

While the AI packaging clearly drove discussion around the video, if you step back for a minute it really is just an impression of Carlin. This impression uses computers, but I’m not convinced that changes (or should change) the fundamental reality of the activity. Generally speaking, people don’t get veto rights over their impersonators.

Furthermore, if an intriguing article by Kyle Orland in Ars Technica is correct, it may be more than “basically just an impression.” It might just be “an impression.”

Orland’s take has subsequently been confirmed by the Dudsey team to the New York Times (“‘It’s a fictional podcast character created by two human beings, Will Sasso and Chad Kultgen,’ Del wrote in an email. ‘The YouTube video ‘I’m Glad I’m Dead’ was completely written by Chad Kultgen.’”), although the same article reports that the Carlin estate continues to be skeptical of the video’s origin (which makes sense because, somewhat ridiculously, the viability of their entire claim may turn on the distinction).

Orland digs into the Dudsey team and their podcast to provide facially compelling evidence that the content of the special is just a regular old impersonation of Carlin. He even pulls what he describes as an “if I did it” quote from Dudsey podcast that accompanied the video describing how the jokes “could” have been created without sophisticated AI:

Clearly, Dudesy made this, but anyone could have made it with technology that is readily available to every person on planet Earth right now.

If you wanted to make something like this, this is what you would do: You would start by going and watching all of George Carlin’s specials, listening to all of his albums, watching all of his interviews, any piece of material that George Carlin has ever made. You would ingest that. You would take meticulous notes, probably putting them in a Google spreadsheet so that you can keep track of all the subjects he liked to talk about, what his attitudes about those subjects were, the relevance of them in all of his stand-up specials.

You would then take all of his stand-up specials and do an average word count to see just how long they are. You would then take all that information and write a brand new special hitting that average word count. You would then take that script and upload it into any number of AI voice generators.

You would then get yourself a subscription to Midjourney or ChatGPT to make all the images in that video, and then you would string them together into a long timeline, output that video, put it on YouTube. I’m telling you, anyone could have made this. I could have made this.

When framed this way, the whole thing starts to feel like a fairly vanilla impression. Which is how the video presents itself, opening with a disclaimer that ““what you’re about to hear is not George Carlin,” going on to compare itself to an impersonation “like Will Ferrell impersonating George W. Bush”.

Does the fact that the video includes a representation that is visually similar to Carlin change that analysis? It doesn’t in the physical world. As Brandon Butler quipped on Mastodon “Nobody tell the folks freaking out over the George Carlin special about the Hal Holbrooke Twain show.”

But the Carlin avatar in the video isn’t just someone dressed up like George Carlin! It is an animated version of him!

There’s nothing new about animated impressions either. This random wiki lists 317 celebrity caricatures from Looney Tunes and Merrie Melodies cartoons. The 1941 short Hollywood Steps Out alone contains dozens.

All of which is to say, doing an impression of someone is not new and does not require that person’s permission. Should doing the impression with a computer upend that?

The Carlin Estate Lawsuit

The Carlin estate lawsuit includes a lot of rhetoric against the video (“Defendants must be held accountable for adding new, fake content to the canon of work associated with Carlin without his permission (or that of his estate).”) and claims of harm from Carlin’s daughter Kelly (“My dad spent a lifetime perfecting his craft from his very human life, brain, and imagination. No machine will ever replicate his genius.”) that could be applied just as easily to any Carlin impression.

The suit also includes claims of violations of California’s Right of Publicity statutes, as well as copyright infringement.

While I think this discussion around copyright, AI training, and AI output is super interesting, I’m not going to be into them in this post. For the purposes of this post, the important thing is that the lawsuit contains the copyright claim at all.

Impression with Computer

The copyright angle only exists because a computer (might be?) involved in creating the new routines. If the new routine was created using Dudesy’s “if I did it” method: “watching all of George Carlin’s specials, listening to all of his albums, watching all of his interviews, any piece of material that George Carlin has ever made,” the Carlin estate would not have any copyright claim to bring, because thinking about things you have read and watched are not activities that rightsholders traditionally get to control

But because this impression does (may?) use computers, it becomes another example of a rightsholder trying to turn “they used a computer” into “I get to control this activity.” If you are someone (like me) who has traditionally been wary of these arguments, this seems like an important time to maintain that skepticism. Even in discussions related to AI.

Header image: Samuel L. Clemens (Mark Twain) from the Smithsonian’s National Portrait Gallery

How Explaining Copyright Broke the Spotify Copyright System

This post originally appeared on the Engelberg Center blog.

This is a story of how Spotify’s sophisticated copyright filter prevented us from explaining copyright law.

It is strikingly similar to the story of how a different sophisticated copyright filter (YouTube’s) prevented us from explaining copyright law just a few years ago.

In fact, both incidents relate to recordings of the exact same event - a discussion between expert musicologists about how to analyze songs involved in copyright infringement litigation. Together, these incidents illustrate how automated copyright filters can limit the distribution of non-infringing expression. They also highlight how little effort platforms devote to helping people unjustly caught in these filters.

The Original Event

This story starts with a panel discussion at the Engelberg Center’s Proving IP Symposium in 2019. That panel featured presentations and discussions by Judith Finell and Sandy Wilbur. Ms. Finell and Ms. Wilbur were the musicologist experts for the opposing parties in the high profile Blurred Lines copyright infringement case. In that case the estate of Marvin Gaye accused Robin Thicke and Pharrell Williams of infringing on Gaye’s song “Got to Give it Up” when they wrote the hit song “Blurred Lines.”

The primary purpose of the panel was to have these two musical experts explain to the largely legal audience how they analyze and explain songs in copyright litigation. The panel opened with each expert giving a presentation about how they approach song analysis. These presentations included short clips of songs, both in their popular recorded version and versions stripped down to focus on specific musical elements.

The YouTube Takedown

After the event, we posted a video of the panel on YouTube and the audio of the panel in our Engelberg Center Live! podcast feed. The podcast is distributed on a number of platforms, including Spotify. Shortly after we posted the video, Universal Music Group (UMG) used YouTube’s ContentID system to take it down. This kicked off a review process that ultimately required personal intervention from YouTube’s legal team to resolve. You can read about what happened here.

The Spotify Takedown

A few months ago, years after we posted the audio to our podcast feed, UMG appears to have used a similar system to remove our episode from Spotify. On September 15, we received an email alerting us that our podcast had been flagged because it included third party content (recall that this content is clips of the songs the experts were discussing analyzing for infringement)

screeenshot from the Spotify alert page with the headline "We found some third-party content in your podcast"

Using the Spotify review tool, we indicated that our use of the song was protected by fair use and did not need permission from the rightsholder.

screeenshot from the Spotify alert page with the headline "We found some third-party content in your podcast" and information about challenging the accusation of infringement

We received a confirmation that our review had been submitted and hoped that would be the end of it.

screeenshot from the Spotify alert page with the headline "Thank you for submitting this episode"

The Escalation

That was not the end of it. On October 12th, we received an email from Spotify that they were removing our episode because it was using unlicensed music and we had not responded to their inquiry.

screeenshot from the Spotify alert email informing us that the episode has been removed from the service

The first part was true - we had not obtained a license to use the music. This is because our use is protected by fair use and we are not legally required to do so. The second part was not true - we had immediately responded to Spotify’s original inquiry. We immediately responded to this new message, noting that we had responded to their initial message, and asking if they needed anything additional from us.

Spotify Tries to Step Away

Four days later, Spotify responded by indicating that this was now our problem:

The content will remain taken down from the service until the provider reaches a resolution with the claimant. Both parties should inform us once they reach a resolution. We will make the content live upon the receipt of instructions from both parties and any necessary updates. If they cannot reach a resolution, we reserve the right to act at our discretion. The email address we have for the claimant is [redacted].

This is probably where most users would have given up (if they had not dropped off well before). However, since we are the center at NYU Law that focuses on things like online copyright disputes, we decided to push forward. In order to do that, we needed more information. Specifically, we needed the original notice submitted by UMG.

Why the Nature of the Notice is Relevant

We needed the original notice from UMG because our next step turned on the actual form it took.

Many people are familiar with the broad outlines of the notice and takedown regime that governs online platforms. Takedown actions initiated by rightsholders are sometimes called “DMCA notices” because a law called the Digital Millennium Copyright Act (or DMCA for short) created the process. While most of the rules are oriented towards helping rightsholders take things off the internet, there is a small provision - Section 512(f) - that can impose damages on a rightsholder who misrepresents that the targeted material is infringing (this provision was famously litigated in the “Dancing Baby” case).

In other words, the DMCA includes a provision that can be used to punish rightsholders who send baseless takedown requests.

We feel that the use of the song clips in our podcast are exceptionally clear examples of the type of use protected under fair use. As a result, if UMG ignored the likelihood that our use was protected by fair use when it filed an official DMCA notice against our podcast, we could be in a position to bring a 512(f) claim against them.

However, not all takedown notices are official DMCA notices. Many large platforms have established parallel, private systems that allow rightsholders to remove content without going through the formal DMCA process. These systems rarely punish rightsholders for overclaiming their rights. If UMG did not use an official DMCA notice to take down our content, we could not bring a 512(f) claim against them.

As a result, our options for pushing back on UMG’s claims were very different depending on the specific form of the takedown request. If UMG used an official DMCA notice, we might be able to use a different part of the DMCA to bring a claim against them. If UMG used an informal process created by Spotify, we might not have any options at all. That is why we asked Spotify to send us the original notice.

Spotify Ignores Our Request for Information

On October 12th, Spotify told us that in order to have our podcast episode reinstated we would need to work things out with UMG directly. That same day, we asked for UMG’s actual takedown notice so we could do just that.

We did not hear anything back. So we asked again on October 23rd.

And on October 26th.

And on October 31st.

On November 7th — 26 days after our episode was removed from the service — we asked again. This time, we sent our email to the same infringement-claim-response@ email address we had been attempting to correspond with the entire time, and added legal@. On November 9th, we finally received a response.

Spotify Asks Questions

Spotify’s email stated that our episode was “not yet subject to a legal claim,” and that if we wanted to reinstate our episode we needed to reply with:

  • An explanation of why we had the right to post the content, and
  • A written statement that we had a good faith belief that the episode was removed or disabled as a result of mistake or misidentification

This second element is noteworthy because it matches the language in Section 512(f) mentioned above.

We responded with a detailed explanation of the nature of the episode and the use of the clips, asserting that the material in question is protected by fair use and was removed or disabled as a result of a mistake (describing the removal as a “mistake” is fairly generous to UMG, but we decided to use the options Spotify presented to us).

Our response ended with another request for more information about the nature of the takedown notice itself. That request specifically asked if the notice was a formal notice under the DMCA, and explained that we were asking because we were considering our options under 512(f).

Clarity from Spotify

Spotify quickly replied that the episode would be eligible for reinstatement. In response to our question about the notice, they repeated that “no legal claim has been made by any third-party against your podcast.” “No legal claim” felt a bit vague, so we responded once again with a request for clarification about the nature of the complaint. The next day we finally received a straightforward answer to our question: “The rightsholder did not file a formal DMCA complaint.”

Takeaway

What did we learn from this process?

First, that Spotify has set up an extra-legal system that allows rightsholders to remove podcast episodes. This system does a very bad job of evaluating possible fair uses of songs, which probably means it removes episodes that make legitimate use of third party content. We are not aware of any penalties for rightsholders who target fair uses for removal, and the system does not provide us with a way to pursue penalties ourselves.

Second, like our experience with YouTube, it highlights how challenging it can be for regular users to dispute allegations of infringement by large rightsholders. Spotify lost our original response to the takedown request, and then ignored multiple emails over multiple weeks attempting to resolve the situation. During this time, our episode was not available on their platform. The Engelberg Center had an extraordinarily high level of interest in pursuing this issue, and legal confidence in our position that would have cost an average podcaster tens of thousands of dollars to develop. That cannot be what is required to challenge the removal of a podcast episode.

Third, it highlights the weakness of what may be an automated content matching system. These systems can only determine if an episode includes a clip from a song in their database. They cannot determine if the use requires permission from a rightsholder. If a platform is going to deploy these types of systems at scale, they should have an obligation to support a non-automated process of challenging their assessment when they incorrectly identify a use as infringing.

We do appreciate that the episode has finally been restored. You can listen to it yourself here, along with audio from all of the Engelberg Center’s events on our Engelberg Center Live! feed, wherever you get your podcasts (including, at least as of this writing, on Spotify). That feed also includes a special season on the unionization of Kickstarter, and on the Knowing Machines project’s exploration of the datasets used to train AI models.