New Open GLAM Toolkit & Open GLAM Survey from the GLAM-E Lab

This post originally appeared on the Engelberg Center blog

Today the GLAM-E Lab, a collaborative project between the Engelberg Center and the University of Exeter (UK), is releasing a number of tools and resources for the open GLAM (Galleries, Libraries, Archives, and Museum) community.

First, the GLAM-E Lab has launched an Open GLAM Toolkit! This suite of tools, developed directly with GLAM organizations, can be used by any cultural organization to develop their own access program and release collections for public reuse. The toolkit even includes templates for model internal and external open access policies for setting up new workflows and website policies.

Second, today the GLAM-E Lab has also launched a website-based version of the Open GLAM Survey. The Survey’s new format makes it much easier to find, explore, and analyze open GLAM organizations around the world than was previously possible via the Google Spreadsheet format.

Third, both of these are only possible because of our collaborators’ engagement. The GLAM-E Lab model is to work directly with GLAM organizations to remove legal barriers to creating open access programs, and convert that work into the standard toolkits that other organizations can use. We set a goal to work with 24 different GLAM organizations by the end of 2024, and we’ve even exceeded that goal!

Finally, all of this work led to the GLAM-E Lab winning Wikimedia UK’s Partnership of the Year Award for 2024!

You can watch our announcement video on YouTube and find more details below on these announcements. Of course, if you or someone else would be interested in working with us in 2025, please let us know!

OpenGLAM Toolkit

The Open GLAM Toolkit is built on everything that we have learned from working with GLAM-E Lab collaborators. When used together, the toolkit resources will help cultural organizations identify, prepare, and publish their digital collections for open access using public domain or other machine readable statements. It includes:

OpenGLAM Survey 2.0

Version 2.0 of the OpenGLAM Survey brings the OpenGLAM Survey to a new, more user-friendly interface. You can sort organizations by type, licenses and the platforms used. The new interface also makes it easier for us to expand the survey and keep its data up to date.

We’ve Collaborated with More than 24 Organizations!

The GLAM-E Lab model is simple: work directly with individual organizations to remove legal barriers to open access programs, and turn what we learn during that work into standard tools and documents that organizations of any size can use.

Of course, all of this depends on having organizations that are open to tackling collections management issues with us in the first place. That’s why we are so excited to wrap up 2024 having worked with over 24 organizations on rights related issues and questions on open access. You can find the list of collaborators on the GLAM-E site.

hero image: Gereedschappen voor het vervaardigen van een mezzotint from the Rijksmuseum collection.

What Does an Open Source Hardware Company Owe The Community When it Walks Away?

This week Prusa Research, once one of the most prominent commercial members of the open source hardware community, announced its latest 3D printer. The printer is decidedly not open source.

That’s fine? My support of, and interest in, open source hardware is not religious. I think open source hardware can be an incredibly effective tool to achieve a number of goals. But no tool is fit for all purposes. If circumstances change, and open source hardware no longer makes sense, people and companies should be allowed to change their strategies as long as they are clear that is what they are doing. Hackaday does a good job of covering the Prusa-specific developments, and Phil has covered other examples (I hesitate to call it a ‘larger trend’ because I don’t think that’s quite right) on Adafruit.

Still, I do believe a company that builds itself on open hardware owes the community an honest reckoning as it walks out the door. Call it one last blast of openness for old time’s sake.

Specifically, I think the company should explain why openness does not work for them anymore. And not just by waiving their hands while chanting vaguely about unfair copying or cloning. They should seriously engage with the issue, explaining how their approach was designed, what challenges it faced, and why open strategies were not up to the task for overcoming those strategies.

This discussion and disclosure is not a punishment for walking away from open, or an opportunity for the community to get a few last licks in. Instead, it is about giving the community more information because that information might be useful to it. Open source hardware is about learning from each other, and how to run an open hardware business is just as important a lesson as how to create an open hardware PCB.

What Could This Look Like?

Last year Průša (the person) raised concerns about the state of open source hardware, framing his post as kicking off a “discussion.” Members of the community took that invitation seriously. I responded with a series of clarifying questions and comments. So did my OSHWA co-board member Thea Flowers, and Phil at Adafruit. Průša is under no obligation to respond to any one of these (me yelling “debate me!” on the internet does not create an obligation on the person to actually respond).

However, kicking off a self-styled discussion, having a bunch of people respond, and then doing . . . nothing does not feel like the most good faith approach to exploring these questions. None of the questions in the response posts were particularly aggressive or merely rhetorical - they were mostly calls for more clarity and specificity in order to inform a more thoughtful discussion.

Without that clarity, we are stuck in a vague space that does not really help anyone understand things better. As the hackaday article astutely points out:

The company line is that releasing the source for their printers allows competitors to churn out cheap clones of their hardware — but where are they?

Let’s be honest, Bambu didn’t need to copy any of Prusa’s hardware to take their lunch money. You can only protect your edge in the market if you’re ahead of the game to begin with, and if anything, Prusa is currently playing catch-up to the rest of the industry that has moved on to faster designs. The only thing Prusa produces that their competitors are actually able to take advantage of is their slicer, but that’s another story entirely. (And of course, it is still open source, and widely forked.)

If moving from open to closed prevents cheap clones, how does that actually work? That would be useful information to the entire open source hardware community! If it does not prevent cheap clones, why use that as a pretext? Also, useful information to the community!

Feature image: Political Discussion in a Lumber Shanty from the Smithsonian Open Access collection

Keep 3D Printers Unlocked (the win! 2023)

Last summer I submitted a request that the Copyright Office renew an existing rule that allows users to break DRM that prevents them from using materials of their choice in 3D printers. As of October 28th, that rule has been renewed for another three years.

This is good news! Copyright law should not allow 3D printing manufacturers to force users to only use approved materials. Wins in copyright policy world are rare, so let’s celebrate one when it comes.

This request was part of a larger every-three-year process involving dozens of requests to allow people to do things that are not prohibited by regular copyright law, but are prohibited by a special provision of copyright law that prohibits breaking digital locks (even for otherwise legal purposes!).

This time around, that larger process was a bit of a mixed bag. The good news is that many of the existing exemptions were renewed. The less good news is that some of the new exceptions were approved on highly restricted terms, making them much less useful.

What happens now? 3D print with whatever material you want, free of fear of a copyright lawsuit (at least over using unapproved material in your printer - what you print can still get you into trouble). Three years from now, we’ll do this dance yet again.

Licensing Deals Between AI Companies and Large Publishers are Probably Bad

Licensing deals between AI companies and large publishers may be bad for pretty much everyone, especially everyone who does not directly receive a check from them.

Although the initial copyright lawsuits from large content companies like Getty Images and music labels are still very much ongoing (with new ones being filed regularly), recently we’ve also seen a series of licensing deals between large content owners and AI companies.

Setting aside the wisdom of the deal for any individual content company, I worry that these licensing deals represent a bad outcome for just about everyone else. Most of the companies entering into these agreements represent a relatively large amount of cultural power (that can be leveraged into public pressure), and a relatively small corpus of works (relative to the amount of works required to train a model), backed up with enough legal power to qualify as a plausible threat to an AI company. That puts them in a position to demand compensation that is out of proportion to their actual contribution to any given model.

The deals that flow from this dynamic allow a small number of companies to claim a disproportionate amount of compensation for their relatively modest contributions to a training dataset. In doing so, the licenses establish a precedent that may undermine the fair use defense for unlicensed training of models, making it harder for smaller competitors to enter the AI market.

This might be a positive development if these deals also increased the likelihood that everyone who created data used to train models would receive significant compensation.* However, these deals likely marginally decrease the likelihood of that outcome by allowing the media companies signing these deals to soak up most of the available licensing dollars before the vast majority of people and companies who created data in the training datasets are involved. The most likely outcome could be one similar to Spotify, where large record labels and a handful of high-profile artists receive significant compensation, while everyone else receives fractions of pennies (or no pennies).

Licensing Dollar Roll Up

It is easy for anyone who wants to be paid a licensing fee by AI model trainers to see these deals as a positive development. They may set a precedent that data must be licensed, and a market rate for data that applies to everyone else.

However, at this stage there does not appear to be any reason to see these deals as setting a standard for anything other than large (or large-ish) media companies and rightsholders. These deals do not set benchmarks for independent artists, or for anyone without the existing cultural and legal clout to demand them. After all, the terms of these deals aren’t even public.

It may be better to understand these deals as the large media companies and rightsholders jumping to the front of the line in order to soak up as much available licensing money as possible. Their incentive is to maximize the percentage of the licensing pool that they receive - not to set a standard on behalf of everyone else, or to grow the pie for others. In fact, every dollar of value that someone outside of the deal can claim is a dollar the large media companies cannot include in their own deal with the AI companies.

The result is that the large media companies leverage “creators should be paid” rhetoric to roll up all of the available licensing dollars, while making it marginally harder for anyone else to be paid for being part of the training data.

Which seems bad! As a bonus, these deals may undermine the fair use defense that allows the models to be created in the first place.

Blocking Competition

The copyright lawsuits over data used to train models all turn on whether or not the training is covered by fair use. If the act of training models on data is fair use, the trainers do not need permission from the data rightsholders (I think this is both the better reading of the law and the better policy outcome). If the act of training is not fair use, the trainers will need permission from every rightsholder of every bit of data they use to train their models.

Determining fair use involves applying a four factor test, one of which is the effect of the use on the potential market for the data. I’m confident that the AI company lawyers are crafting these agreements with an eye towards avoiding establishing a market for AI training data (available public information on the deals suggest that they are framed in terms of making it easier to access the data through APIs or other bulk data transfers, not a license to the data itself). Nonetheless, the existence of these deals does probably marginally increase the likelihood that courts would decide that there is a functioning market for licensing training data.

If that were the case, and courts found that the majority of the other fair use factors pushed against a finding of fair use, that would mean that only companies with enough money to license training data at scale could train new AI models. I think this would probably be a bad policy outcome because it could effectively block new market entrants in AI. And working out the licensing process would be somewhere between complicated and impossible.

All of which makes these deals bad for pretty much everyone. They are bad for any creators who are not being directly paid by them, bad for anyone who would welcome new competition in AI, and bad for anyone who generally thinks that non-consumptive uses of information on the internet should be protected by fair use.

*I currently believe that compensating everyone who created data used to train models is a bad idea, but I understand why it is an attractive option to many people.

Hero Image: A nun frightened by a ghost playing a guitar; page 65 from the “Images of Spain” Album (F)

Is There A Coherent Theory of Attributing AI Training Data?

It feels like any time I have a conversation about attributing data used to train AI models, the completely understandable impulse to want attribution starts to break when confronted with some practical implementation questions.

This is especially true in the context of training data that comes from open communities. These communities rely on some sort of open license that requires attribution (say, CC BY or MIT), and have been built on a set of norms that place a high value on attribution. Regardless of whether or not complying with the license is legally required in these scenarios, many members of the community view attribution as a key element of its social contract.

Since is already solving a number of problems in these communities, it is not hard to imagine a situation where attribution could solve some additional social and political problems related to the growth of these models. These problems tend to be most acute around LLMs and generative AI, but are also relevant to a broader set of AI/ML models.

This post is my attempt to describe some of the practical implementation issues that come to mind when talking about attribution and AI training datasets. It is not intended to be a list of things that advocates for attribution “must fix” before attribution makes sense, or a list of reasons why attribution is impossible. It also is not intended to advocate for or against the value or legal necessity of attribution for AI training data.

Instead, it is a list of things that I would like to figure out before being convinced that attribution is something worth pursuing. At the end, it also flags one lesson from existing open source software licensing that makes me somewhat more skeptical of this approach.

The Simple Model

Let’s start with the simple model of how large, foundational models are trained.

If you wanted to train one of these models, you might start with something like Common Crawl, a repository of 2.7 billion web pages, or LAION-400-Million, a collection of 400 million images paired with their captions. You could throw these datasets into the pot with millions of dollars of computational power, plus some time, and get your very own AI model (again, this is the simple model of how all of this works).

You know all of the data and datasets you used to train the model, so when you release the model you include a readme.txt file that has 3 billion or so entries listing all of the sources you used. Problem solved?

What Works About This Approach

You have given attribution! Each bit of information you used to train the model is right there in the list, alongside the 3 billion other things that went into the pot.

What Doesn’t Work About This Approach

Maybe this solves the problem? Or maybe not?

Everything maps to everything

One problem with this approach is that an undifferentiated list of all of the training data might not be the kind of attribution people are looking for. On some level, if you are training a model from scratch, every data point contributes equal to the creation of that model and every given output of the model.

However, it is also easy to come to an intuitive belief that some data might be more important than others when it comes to a specific output (there are also more quantitative ways one might come to this conclusion). Also, what if it is possible to have a model unlearn a specific bit of data and then continue to perform the same way it performed before unlearning that bit of data? If your model starts generating poetry, the training data that is just a list of calculus problems and solutions might feel less important than pages of poems.

Does that mean that your attribution should prioritize some data points over others when it comes to specific outputs? How could you begin to rank this? Is there a threshold below which something shouldn’t be listed at all? Is there a point where some training data should get so much credit that it gets some sort of special recognition? Would you make this evaluation on a per-model basis, or on a per-output basis?

Is this meaningful attribution?

Being listed as one of 3 billion data points used to train a model is attribution in a literal sense, but is it attribution in a meaningful sense? Should that distinct matter?

The Creative Commons license requires that a licensor give “appropriate credit,” which is defined mostly in the information it contains, not the form that attribution takes. The CC Wiki contains further best practices for attribution, while being upfront that “Because each use case is different, you can decide what form of attribution is most suitable for your specific situation.”

Regardless of whether or not the 3 billion line readme.txt file is legally compliant with a CC license, there may be a significant number of creators who feel that it does not meaningfully address their wishes. Of course, it will also always be impossible to address the wishes of all 3 billion creators anyway. To the extent that attribution is a social/political solution instead of a legal solution, not addressing the wishes of some critical mass of those creators will significantly reduce its utility. Open Future’s Alignment Assembly on AI and the Commons is interesting to consider in this context. Regardless, if listing people in a 3 billion line file does not meet the expectations of a critical mass of people, is it worth imposing as an expectation?

There is a similar, yet also somewhat distinct, version of this question when it comes to open source software. Under most open source software licenses, attribution is including specific text in a file bundled with the code. However, as discussed more at the end of this post, even that can be of questionable utility at scale.

Is this just a roadmap for lawsuits?

Is disclosing everything you used to train your model just a roadmap for lawsuits? This question is a bit harder to answer. IF training models require the permission of the rightsholder AND the 3 billion entry readme.txt does not meet a permissive license’s attribution requirements, then training on data that requires attribution without providing that attribution is infringement. But that’s a big IF.

If it is fair use to train AI models on unlicensed data, listing the training data won’t be a roadmap for lawsuits because creators don’t have a copyright claim to bring against trainers.

Also, current experience suggests that lawsuits aren’t waiting for this particular roadmap anyway.

That means that the roadmap question may be best answered by examining the underlying fair use question, not by weighing the value of attribution. And the fair use fate of AI training is probably much more relevant to it than the attribution compliance question. So I’m going to call this concern both worth flagging and out of scope of this post.

The More Complicated Models

The simple model is just that - a simplified way of thinking about model training. Some of the things that happen in real-world training further complicate things (for a great illustration of many of these relationships, check out Christo Buschek & Jer Thorp’s Models All The Way Down).

Models Trained by Models

As these models are trained on large datasets, it probably should not come as a surprise that trainers are using other AI models to structure, parse, and prepare datasets before they are used to train the model. These structuring, parsing, and preparing models are contributing to the final output model, at least in some ways.

Should the attribution for the final model include all of the data used to train those models as well? Should it differentiate between data used to train the “helper” models and the primary models? How recursively should that obligation extend? What should it mean if the users of the trainer models do not have access to the data used to train those models?

Models Tuned by Models

Large, foundational models are now being tuned to do all sorts of specific tasks. This tuning can include building on an open model like Llama, or importing your own dataset to build a custom RAG.

In cases like this, should attribution include all of the data from the foundational model, plus all of the customized training data? If someone is doing their own tuning without access to a complete list of the training data, is releasing the list of the data they used enough? Should tuning a model built on attributed data create an obligation to release your own tuning data?

Models Optimized by People Who Built Other Models

To me, this is one of the most conceptually interesting situations, and one that highlights the various ways that “learn” is used in these conversations.

The first two scenarios in this section describe some sort of direct lineage between models. However, there are also less direct, more human-mediated linkages. If a team builds model A, they will bring whatever they learned in that process to building model B. That’s true even if model A isn’t “used” to build model B in some sort of literal sense. Should they still attribute the data they used to gain the human knowledge from building model A in the release notes for model B?

This is not a hypothetical scenario. For example, OpenAI describes a “research path” linking GPT, GPT-2, GPT-3, and GPT4, explaining that they take what they learn in developing each model and apply it to the next one:

A year ago, we trained GPT-3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time.

If your data was used to create GPT-3.5, which taught OpenAI something about training models that allow them not to have to use your data to train GPT-4, have you still made a creditable contribution to GPT-4? How far back should we follow this chain of logic?

Lessons Learned from Open Source Software

AI is not the first time that open communities have had to wrestle with the limits of attribution requirements, or with long, nested attribution documents. While there are examples of these across open communities, the most relevant examples probably come from open source software.

Software is made of other software. Most pieces of software contain a number of other pieces of open source software with licenses that require attribution. In practice, this takes the form of super long text files tucked into software releases.

As Kate Downing points out, the value of the attribution requirement in modern software seems to be pretty low. Software may come with pages and pages of attributions that comply with the license requirement (Downing links to a not-particularly-unique example where the attribution document is 15,385 pages long), but it is unclear if the existence of that document is much more than a compliance chore for a responsible maintainer.

To give another illustration of how long and recursive these documents can be, while drafting this post I also had to update the bios on my printer. The tool I was using required me to agree to the license, and then listed what would have been many pages of attribution. This happened to be the last attribution listed (and therefore the only one I actually saw):

screenshot of license attribution text crediting 'Ty Coon, President of Vice' dated 1 April 1989

How important is the attribution requirement to Ty Coon, President of Vice and creator of some component of my printer driver update utility? Hard to say. update 7/15/24: as pointed out by Luis Villa on mastodon, Ty Coon is in the sample attribution statement at the bottom of earlier versions of the GPL. This occasionally creates confusion outside of this blog. The one snippet of the code I saw when using the utility was not even the (fake) name of one of its many contributors - it was just the end of the appendix to the license they used, pushing their actual name outside of my terminal window.

Given this history, would it be productive to import this practice into training AI models? I’m skeptical.

update 7/17/24: Back in 2016, Luis Villa also wrote a series of blog posts about a very similar attribution problem that argued against trying to use copyleft for databases. If you made it this far, they are very much worth checking out, because most of those problems reappear in the context of AI.

Hero Image: Pidgeon Hole. A Convent Garden Contrivance to Coop up the Gods