Maybe LLMs Won’t Raise 230 Questions?

Recently I read two blog posts about the intersection of Section 230 and generative AI, specifically LLMs. While they are both interesting, I think they skip over a potentially limiting constraint on the importance of these questions: the blast radius of any specific piece of AI generated content on a website. Specifically, it seems plausible that blast radius - or damaging reach of the piece of content - may be fairly limited, which would reduce the likelihood that 230 protections end up being super relevant.*

I agree with Professor Matt Perault in Lawfare that Section 230 does not currently cover content generated by generative AI managed/hosted/whatever by a given website. I also agree with John Bergmayer over at the Public Knowledge blog that this state of affairs is a good one, at least for now.

Where I may disagree with them is how often this kind of thing is likely to come up in a context that feels 230-familiar. (I say “may” because both of them are focused on a different part of this analysis, so I don’t know how they feel about this). AI is already raising legal issues, and websites will host third party material created by AI. But today’s deployment of AI may not raise new 230 issues.

One standard 230 fact pattern is Person A posts content on Website B. Person C objects to that content (because it defames them, or causes them some other harm), and sues Website B for hosting it. In most cases, Section 230 allows Website B to step aside, telling Person C to sue Person A if they don’t like the content.

A key element of this pattern is usually that Person A’s potentially harmful content is available for many people to see. That makes the potential blast radius for harmful information quite large.

However, current generative AI usage patterns tend to be a bit different. Services like Microsoft Bing’s Sydney, or DuckDuckGo’s DuckAssist are designed to create custom content for an audience of one. That content can be hugely problematic. But in most cases the output isn’t available more broadly. That could severely limit the blast radius for the harmful information. A reduced blast radius makes it less likely the harmed party will know about the harm, and that the harm will be significant enough to justify a lawsuit.

Of course, there are two other obvious scenarios where this type of AI could create 230 issues. One is where Person A uses a generative AI service to create content, and then brings that content to Website B. In that case, the fact that generative AI was used to create the content should not be particularly relevant to Website B’s ability to get out of the suit. It would not make very much sense to have a 230 carveout for harmful content that happened to be created by generative AI.

Which brings us to the other scenario. If Person A uses Website D to generate the problematic content, Website D might be pulled into any related litigation. That seems like a fact pattern outside of 230, and pretty much what Bergmayer is contemplating in his piece. In that case, it does seem to be at least facially reasonable to allow a court to explore Website D’s liability for the content. That could even be true if the Website D content just stays on Website D. While it would be harder to discover and document, Website D creating millions of bespoke pieces of content that slander Person C does feel like something that Website D could be sued for.

*I am super aware that projecting the future impact of technology based on current use patterns can be a recipe for disaster. Sorry future Michael for any problems or embarrassment this post causes you!

Feature image: Little Billy Bryan Chasing Butterflies from the Smithsonian National Portrait Gallery. I’m not going to pretend that it contains some larger commentary about this post. I was poking around looking for an image, happened to see this, and obviously needed to use it.

Pioneers of Open Access Report

Earlier this month the Engelberg Center and Creative Commons released a report on GLAM institutions who were early to adopt open access policies. This post originally ran on the Creative Commons site. The paper hosted here is the same paper as hosted there. The only difference is that I simplified the file name from the original “Final-Pioneers-of-Open-Three-Case-Studies.pdf-correctedByPAVE.pdf” because, well.

Ever wondered how it must have been for some of the first cultural heritage institutions to embark on their open access journey? Michael Weinberg, Executive Director of the Engelberg Center on Innovation Law & Policy at NYU Law, talked to three major institutions that helped shape the early open GLAM / open culture movement to find out. Here’s what he found.

The list of Galleries, Libraries, Archives, and Museums (GLAMs) with open access programs gets longer every day. However, those programs don’t just happen. They are the result of work from teams inside and outside of the institution.

Like the commons they create, the open access programs build on one another. Each open access program launched today uses lessons learned from programs that came before.

“Pioneers of Open Culture” contains three case studies of open GLAM early adopters. It examines some of the institutions that created open access programs in the early days of the movement.

The National Gallery of Art (United States), Statens Museum for Kunst, and New York Public Library are different institutions. They have different funding models, different relationships to government, and different styles of public engagement. In the years since they started, their open access programs have taken different directions. However, all three pioneered their own versions of successful open access programs.

None of these institutions would claim to have built their programs alone. They were part of communities, discussions, and practices that evolved along with them. At the same time, these institutions navigated their environment with many fewer models than are available today. That forced them to learn lessons that today’s institutions can take for granted. These case studies help shed light on that process.

Pioneers of Open Culture is not a comprehensive analysis of each institution’s open access program. It also does not explore all of the institutions that contributed to the early days of the open culture movement. Instead, it is an exploration of how some of the people who created and operated these programs understood their work. The goal is to provide a window into the process. This window might help those who want to follow similar paths.

While each case study has conclusions specific to the institution, a few points of commonality do begin to emerge:

Digital Infrastructure Matters. Successful open access programs are built on digital foundations that directly incorporate rights and rights awareness. Digital systems redesigns were opportunities to build the possibility of open into an institution’s DNA. Well designed digital backends also made it easier to experiment with smaller projects that were not true one-offs, but rather closely integrated into the institution’s technology infrastructure.

Experimentation is Important. Collections are diverse, as are the users who are interested in them. Open access programs succeed when there is space to try new things, and create multiple points of entry into an institution’s collections. This is true for members of the public who want to explore the collection. It is also true of internal stakeholders who want to understand how open access can help them achieve their own goals. Space takes the form of financial support from within and without the institution. It also takes the space of an institutional environment that is welcoming to experimentation.

Make the Easy Things Easy. Open access programs can be challenging to construct and sustain. Technology must be built. Collections must be designed. Rights statuses must be documented. That makes it important to use tools that make things easier whenever they exist. Those tools include legal tools such as the CC0 public domain dedication, and technical tools such as open source software. The reliability of these tools allows teams to focus on the hard parts of creating open access collections.

“Pioneers of Open Culture” brings color and context to the history of open access. Hopefully, understanding that history can help accelerate open access programs yet to be created and encourage people to embark on better sharing of cultural heritage worldwide.

I’m Not Sure That (If?) GitHub Copilot is a Problem

Last week a new Github Copilot investigation website created by Matthew Butterick brought the conversation about GitHub’s Copilot project back to the front of mind for many people, myself included. Copilot, a tool trained on public code that is designed to auto-suggest code to programmers, has been greeted by excitement, curiosity, skepticism, and concern since it was announced.

The Github Copilot investigation site’s arguments build on previous work by Butterick, as well as thoughtful analysis by Bradley M. Kuhn at the Software Freedom Conservancy. I find the arguments contained in these pieces convincing in some places and not as convincing in others, so I’m writing this post in the hopes that it helps me begin to sort it all out.

At this point, Copilot strikes me as a tool that replaces googling for stack overflow answers. That seems like something that could be useful. It also seems plausible that training such a tool on open public software repositories (including open source repositories) could be allowed under US copyright law. That may change if or when Copilot evolves, which makes this discussion a fruitful one to be having right now.

Both Butterick and Kuhn combine legal and social/cultural arguments in their pieces. This blog post starts with the social/cultural arguments because they are more interesting right now, and may impact the legal analysis as facts evolve in the future. Butterick and Kuhn make related arguments, so I’ll do my best to be clear which specific version of a point I’m engaging with at any given time. As will probably become clear, I generally find Kuhn’s approach and framing more insightful (which isn’t to say that Butterick’s lacks insight!).

What is Copilot, Really?

A large part of this discussion seems to turn on the best way to think about and analogize what Copilot is doing (the actual Copilot page does a pretty good job of illustrating how one might use it).

Butterick seems to think that the correct way to think about Copilot is as a search engine that points users to a specific part of a specific (often open source) software package. In his words, it is “a con­ve­nient alter­na­tive inter­face to a large cor­pus of open-source code”. He worries that this “selfish interface to open-source software” is built around “just give me what I want!” (emphasis his).

The selfish approach may deliver users to what they think they want, but in doing so hides the community that exists around the software and removes critical information that the code is licensed under an open source license that comes with obligations. If I understand the argument correctly, over time this act of hiding the community will drain open source software of its vitality. That makes Copilot a threat to open source software as a sustainable concept.

But…

The concern about hiding open source software’s community resonates with me. At the same time, Butterick’s starting point strikes me as off, at least in terms of how I search for answers to coding questions.

This is probably a good place to pause and note that I am a Very Bad coder who, nonetheless, does create some code that tends to be openly licensed and is just about always built on other open source code. However, I have nowhere near the skills required to make a meaningful technical contribution to someone else’s code.

Today, my “convenient alternative interface” to finding answers when I need to solve coding problems is google. When I run into a coding problem, I either describe what I am trying to do or just paste the error message I’m getting into google. If I’m lucky, google will then point me to stack overflow, or a blog post, or documentation pages, or something similar. I don’t think that I have ever answered a coding question by ending up in a specific portion of open source code in a public repo. If I did, it seems unlikely that code - even if it had great comments - would get me where I was going on its own because I would not have the context required to quickly understand that it answered my question..

This distinction between “take me to part of open source code” (Butterick’s view) and “help me do this one thing” (my view) is important because when I look at the Copilot website, it feels like Copilot is currently marketed as a potentially useful stack overflow commenter, not someone with an encyclopedic knowledge of where that problem was solved in other open source code. Butterick experimented with Copilot in June and described the output as “This is the code I would expect from a talented 12-year-old who learned about JavaScript yesterday and prime numbers today.” That’s right at my level!

If you ask Copilot a question like “how can I parse this list and return a different kind of list?,” in most cases (but, as Butterick points out, not all!) it seems to respond with an answer synthesized from many different public code repositories instead of just pointing to a single “best answer” repo. That makes Copilot more of a stack overflow explorer than a public code explorer, albeit one that is itself trained by exploring public code. That feels like it reduces the type of harm that Butterick describes.

Use at Your Own Risk

Butterick and Kuhn also raise concerns about the fact that Copilot does not make any guarantees about the quality of code it suggests. Although this is a reasonable concern to have, it does not strike me as particularly unique to Copilot. Expecting Copilot to provide license-cleared and working code every time is benchmarking it against an unrealistic status quo.

While useful, the code snippets I find in stack overflow/blog post/whatever are rarely properly licensed and are always “use at your own risk” (to the extent that they even work). Butterick and Kuhn’s concerns in this area feel equally applicable to most of my stack overflow/blog post answers. Copilot’s documentation if fairly explicit about the value of the code it suggests (“We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself.”), for whatever that is worth.

Will Copilot Create One Less Reason to Interact Directly with Open Source Code?

In Butterick’s view, another downside of this “just give me what I want” service is that it reduces the number of situations where someone might knowingly interact with open source code directly. How often do most users interact directly with open source code? As noted above, I interact with a lot of other people’s open source software as an extremely grateful user and importer of libraries, but not as a contributor. So Copilot would shift my direct deep interaction with open source code from zero to zero.

Am I an outlier? Nadia Asparouhouva (née Eghbal)’s excellent book Working in Public provides insight into open source software grounded in user behavior on Github. In it, she tracks how most users of open source software are not part of the software’s active developer community:

“This distribution - where one or a few developers do most of the work, followed by a long tail of casual contributors, and many more passive users - is now the norm, not the exception, in open source.”

She also suggests that there may be too much community around some open source software projects, which is interesting to consider in light of Butterick’s concern about community depletion:

”The problem facing maintainers today is not how to get more contributors but how to manage a high volume of frequent, low-touch interactions. These developers aren’t building communities; they’re directing air traffic.”

That suggests that I am not necessarily an outlier. But maybe users like me don’t really matter in the grand scheme of open source software development. If Butterick is correct about Copilot’s impact on more active open source software developers, that could be a big problem.

Furthermore, even if users like me are representative today, and Copilot is not currently good enough to pull people away from interacting with open source code, might it be in the future?

“Maybe?” feels like the only reasonable answer to that question. As Kuhn points out, “AI is usually slow-moving, and produces incremental change far more often than it produces radical change.” Kuhn rightly argues that slow-moving change is not a reason to ignore a possible future threat. At the same time, it does present the possibility that a much better Copilot might itself be operating in an environment that has been subject to other radical changes. These changes might enhance or reduce that future Copilot’s negative impacts.

Where does that leave us? The kind of casual interaction with open source code that Butterick is concerned about may happen less than one might expect. At the same time, today’s Copilot does not feel like a replacement for someone who wants to take a deeper dive into a specific piece of open source software. A different version of Copilot might, but it is hard to imagine the other things that might be different in the event that version existed. Today’s version of Copilot does not feel like it quite manifests the threat described by Butterick.

Copilot is Trained on Open Source, Not Trained on Open Source

For some reason, I went into this research thinking that Copilot had explicitly been trained on open source software. That’s not quite right. Copilot was trained on public Github repositories. Those include many repositories of open source software. They also include many repositories of code that is just public, with no license, or a non-open license, or something else. So Copilot was trained on open source software in the sense that its training data includes a great deal of open source software. It was not trained on open source software in the sense that its training data only consists of open source software, or that its developers specifically sought out open source software as training data.

This distinction also happens to highlight an evolving trend in the open source world, where creators conflate public code with openly licensed code. As Asparouhouva notes:

”But the GitHub generation of open source developers doesn’t see it that way, because they prioritize convenience over freedom (unlike free software advocates) or openness (unlike everly open source advocates). Members of this generation aren’t aware of, nor do they really care about, the distinction between free and open source software. Neither are they fired up about evangelizing the idea of open source itself. They just publish their code on GitHub because, as with any other form of online content today, sharing is the default.”

As a lawyer who works with open source, I think the distinction between “openly/freely licensed” and “public” matters a lot. However, it may not be particularly important to people using publicly available software (regardless of the license) to get deeper into coding. While this may be a problem that is exacerbated by Copilot, I don’t know that Copilot fundamentally alters the underlying dynamics that feed it.

As noted at the top, and attested to by the body of this post so far, this post starts with the cultural and social critiques of Copilot because that is a richer area for exploration at this stage in the game. Nonetheless, the critiques are - quite reasonably - grounded in legal concerns.

Fair Use

The legal concerns are mostly about copyright and fair use. Normally, in order to make copies of software, you need permission from the creator. Open source software licenses grant those permissions in return for complying with specific obligations, like crediting the original creator.

However, if the copy being made of the software is protected by fair use, the copier does not need permission from the creator and can ignore any obligations in a license. In this case, Github is not complying with any open source licensing requirements because it believes that its copies are protected by fair use. Since it does not need permission, it does not need to copy with license requirements (although sometimes there are good reasons to comply with the social intent of licenses even if they are not legally binding…). It has said as much, although it (and its parent company Microsoft) has declined to elaborate further.

I read Butterick as implying that Github and Microsoft’s silence on the details of its fair use claim means that the claim itself is weak: “Why couldn’t Microsoft pro­duce any legal author­ity for its posi­tion? Because [Kuhn and the Software Freedom Conservancy] is cor­rect: there isn’t any.”

I don’t think that characterization is fair. Even if they believe that their claim is strong, Github cannot assume that it is so strong as to avoid litigation over the issue (see, e.g. the existence of the Github Copilot investigation website itself). They have every reason to avoid pre-litigating the fair use issue via blog post and press release, keeping their powder dry until real litigation.

Kuhn has a more nuanced (and correct, as far as I’m concerned) take on how to interpret the questions: “In fact, these areas are so substantially novel that almost every issue has no definitive answers”. While it is totally reasonable to push back on any claims that the law around this question is settled in Github’s favor (Kuhn, again, “We should simply ignore GitHub’s risible claim that the “fair use question” on machine learning is settled.”), that is very different than suggesting that it is settled against Github.

How will this all shake out? It’s hard to say. Google scanned all the books in order to create search and analytics tools, claiming that their copies were protected by fair use. They were sued by The Authors Guild in the Second Circuit. Google won that case. Is scanning books to create search and analytics tools the same as scanning code to create AI-powered autocomplete? In some ways yes? In other ways no?

Google also won a case before the Supreme Court where they relied on fair use to copy API calls. But TVEyes lost a case where they attempted to rely on fair use in recording all television broadcasts in order to make it easy to find and provide clips. And the Supreme Court is currently considering a case involving Warhold paintings of Prince that could change fair use in unexpected ways. As Kuhn noted, we’re in a place of novel questions with no definitive answers.

What About the ToS?

As Franklin Graves pointed out, it’s also possible that Github’s Terms of Service allow it to use anything in any repo to build Copilot without worrying about addition copyright permissions. If that’s the case, they won’t even need to get to the fair use part of the argument. Of course, there are probably good reasons that Github is not working hard to publicize the fact that their ToS might give them lots of room when it comes to making use of user uploads to the site.

Where Does That Leave Things?

To start with, I think it is responsible for advocates to get out ahead of things like this. As Kuhn points out:

”As such, we should not overestimate the likelihood that these new systems will both accelerate proprietary software development, while we simultaneously fail to prevent copylefted software from enabling that activity. The former may not come to pass, so we should not unduly fret about the latter, lest we misdirect resources. In short, AI is usually slow-moving, and produces incremental change far more often than it produces radical change. The problem is thus not imminent nor the damage irreversible. However, we must respond deliberately with all due celerity — and begin that work immediately.”

At the same time, I’m not convinced that Copilot is a problem. Is it possible that a future version of Copilot would starve open source software of its community, or allow people to effectively rebuild open source code outside of the scope of the original license? It is, but it seems like that version of Copilot would be meaningfully different from the current version in ways that feel hard to anticipate. Today’s Copilot feels more like a fast lane to possibly-useful stack overflow answers than an index that can provide unattributed snippets of all open source software.

As it is, the acute threat Copilot presents to open source software today feels relatively modest. And the benefits could be real. There are uses of today’s Copilot that could make it easier for more people to get into coding - even open source coding. Sometimes the answer of a talented 12 year old is exactly what you need to get over the hump.

Of course, Github can be right about fair use AND Copilot can be useful AND it would still be quite reasonable to conclude that you want to pull your code from Github. That’s true even if, as Butterick points out, Github being right about fair use means that code anywhere on the internet could be included in future versions of Copilot.

I’m glad that the Software Freedom Conservancy is getting out ahead of this and taking the time to be thoughtful about what it means. I’m also curious to see if Butterick ends up challenging things in a way that directly tests the fair use questions.

Finally, this entire discussion may also end up being a good example of why copyright is not the best tool to use against concerns about ML dataset building. Looking to copyright for solutions has the potential to stretch copyright law in strange directions, cause unexpected side effects, and misaddressing the thing you really care about. That is something that I am always wary of, and a pior that informs my analysis here. Of course, Amanda Levandowski makes precisely the opposite argument in her article Resisting Face Surveillance with Copyright Law.

Image: Ancient Rome from the Met’s open access collection.

Lincoln Hand Shifter Knob

gif of Lincoln hand shifter as installed

In the interest of celebrating the weirdness of open data, I want to share a quick project that exists because of open data: Abraham Lincoln’s left hand as the shifter knob of a 1995 Mazda truck.

The whole thing was pretty strightforward. In fact, the hardest part was probably finding the right shifter knob adapter for the truck. All that was required was:

  • Download the Lincoln hand scans from the Smithsonian open access site.

  • Use tinkercad to put a hole in the back of the hand.

  • 3D print the hand and use epoxy to attach the adapter.

hand and adapter hand and adapater together

spinning gif of combined version

  • Install it.

image of Lincoln hand shifter as installed

Printables, Honda, Platforms, and Nastygrams

Last week the 3D printing platform Printables removed an unknown number of models from their platform. This action was apparently in response to a letter Printables received from Honda claiming that user models infringed on various rights. Based on the discussion of the action in the Printables forum, it appears that at least some of Honda’s claims may have been related to the use of Honda’s trademarks in either model geometry or model descriptions.

Many people have criticized Honda’s decision to send this letter in the first place. For example, while I have some quibbles with the legal details in this Hackaday post, I think its criticism of Honda’s failure to meet its community where it is are directionally correct. Others, including me, also directed some criticism at Printables itself for what appears to be, from the outside (always an unreliable evaluation viewpoint), a fairly noncritical acquiescence to Honda’s claims. (In my defense, describing the letter as “a huge legal document” imposing a “very tight deadline” in explaining why the takedown happened does not exactly suggest a carefully considered review.)

In any event, I’ve written about these types of (potentially) overly broad takedown claims before, and about the structural incentives that can punish platforms for viewing them critically.

Instead of just complaining about everyone’s behavior, in this post I want to be productive. The post will try and walk through how I would think about processing and responding to this kind of letter. Since, like everything on this site, this post is not legal advice, I’m going to sidestep the legal details and focus on more operationally-oriented steps (if you are curious, the posts linked to above provide some legal context). Those legal details will matter when trying to actually implement anything like this approach on a specific platform (especially across jurisdictions). However, they are not necessary to follow the general flow.

Step 1: Take a breath, read, and sort

It is important to remember that no one just happens to send a long, scary looking letter on fancy letterhead that includes a short deadline for response. These letters - sometimes referred to by lawyers as ‘nastygrams’ - are designed to intimidate and encourage compliance.

That does not mean that you should ignore them! But it does mean that you should keep that in mind when you are reading them. That’s why the first thing to do when receiving a nastygram is to take a deep breath and remember that the letter is, at least partially, designed to intimidate you.

The second thing to do is to actually read the letter and map out what it says. Specifically, what rights is the sender actually claiming, and how are they connecting those rights to specific models (either individually or as a class)? Lawyers can sometimes try and bluff their way through these details, so read the letter critically. The details will matter later on.

Once you have read the letter, try and sort the claims and models into specific buckets. Does the letter claim that some models infringe on copyrights while others infringe on trademarks? Are objections to models or the language describing the models? Something else entirely?

Step 2: Act on the easy stuff

If the letter includes all of the elements of a true DMCA takedown, claims that specific models infringe on copyright, and lists the models, it should be easy to deal with those models with an existing DMCA process. No reason to wait. If the letter includes trademark claims, try and make some triage decisions. Not all uses of a trademark infringe on the mark! If you are lucky and have thought about this stuff in advance (see below), act on any models that are easy calls. Do so knowing it can be ok to take more time considering the models that feel closer to the line.

Step 3: Reach out and ask for clarification

Once you have your head around what the letter is really asking of you and made some easy decisions, it may be time to reach out to the party that sent it. Reaching out can show the sender that you exist and are a good faith actor. Tell them what you have done, and ask for clarifications to help you evaluate whatever is left.

There are a few reasons to reach out even if you have not immediately and fully complied with the letter’s request. With regard to the party that sent you the letter, it is likely that they send these kinds of letters to all sorts of sketchy, bad-faith actors and never receive any sort of response. Responding signals to them that there is a real person at the platform who is paying attention, taking their concerns seriously, and acting in good faith. Depending on how you structure your questions, it can also be a way to signal that you won’t be intimidated by broad gestures at unspecified ‘rights’ that are not tied to specific claims.

The second audience for your response is a court. If things go totally sideways, your dispute may end up in front of a court. At that point, the judge or jury will need to decide if you are the horrible pirate den that you are accused of being, or a responsible, responsive community of creative people trying to balance many competing rights. Building a record of constructive engagement can be helpful in making the case that you fall into the second category.

In formulating your response, it can also be helpful to have done some thinking in advance about what you might want to push back on and why. Are you a platform that is content to let large rightsholders define the rules, even if large rightsholders want to create rules that give them much more power than they are legally entitled to? Or are you trying to create a space where people can engage with the world in a way that recognizes that rights exist and have limitations? This can be a harder decision than it might appear. Not every platform sees itself as working with intentionality to create space for its users. That’s why it is helpful to consider it outside of a crisis context. Understanding your own framework will help you calibrate your response.

Step 4: Be as transparent as possible

Whatever you end up doing, take steps to explain to targeted users and the community exactly what is going on. There will be limits to your transparency - to protect users, the platform, and even the party that sent you the letter in the first place. However, to the extent possible, explain what rights are alleged to be infringed upon, how you evaluate those claims, and what steps all parties can take to avoid problems in the future.

None of this will eliminate conflicts between external rightsholders, users, and a platform. However, if done right, it introduces a degree of accountability into the process for everyone involved. If nothing else, that helps to make sure that the balance struck by the rules governing a platform are reasonably related to the balance struck by the law.

Header Image: The Board of Censors Moves Out from the Met’s open access collection.