April 05, 2023Michael Weinberg

A(nother) Reply to Josef Průša

Last week Josef Průša, the eponymous proprietor of the important open source hardware company Prusa Research, shared a post about the state of open source in 3D printing.

This is exciting. One of the things I like so much about the open source hardware community is how broadly it embraces openness. That includes having open, public conversations about what it means to do open source hardware, and to run an open source hardware company.

The Prusa post raises some interesting questions. Some of them are new. Others may not be as new (although that does not mean that they are resolved). Thea Flowers (not incidentally, like me, also an OSHWA board member) has already jumped in to the conversation the post calls for with a reply. It’s a thoughtful one, and if you care about these questions I would strongly recommend reading it.

A big part of the Prusa post talks about legal issues and licenses. This is something I have THOUGHTS and OPINIONS about, which means I have a few questions. Those questions tend to be about getting more specific information about the types of problems the Prusa post identifies, and about the types of control it assumes it can impose via license. Right now, I feel like I don’t fully understand the problem, or how an intellectual property-based license might address it.

Before getting into the legal stuff, I want to touch on one or two other things.

Goals are Important

One of the things I really appreciated about the Prusa post is this statement:

But community development isn’t the main reason why we offer our products as open source. Our main goal has always been to make our printers easy to maintain and modify, so people and companies can play and experiment with software and hardware.

(emphasis in original)

The thing I really appreciate about this statement is that it articulates a goal for using an open source approach. Open source hardware is a strategy. Like any strategy, it is appropriate in some situations but not others. Also, like any strategy, it is not magic. (more about this in the 2020 Open Source Hardware Weather Report.

Articulating a goal for using open source hardware gives you a way to evaluate your strategy. You can decide if it is working, if it is the best way to achieve your goal, and if you need to invest more or less in it.

Is open source helping Prusa Research achieve its goals in a way that justifies the cost? I’m not sure. I am glad that they have something concrete to use to evaluate it.

Assertions I am Less Sure About

The Prusa post also makes some assertions about the nature and state of open source hardware. I’m not convinced that either of them are true, although I am open to being convinced. I would love to see a deeper discussion about both of them, and more data about the second point to the extent that it exists.

Assertion #1: Open Source Relies on Everyone Playing by the Rules

The Prusa post says:

The open-source movement relies on the fact that everyone involved plays by the same rules.

(emphasis in original)

I’m not sure that has ever been the case, or that it even should be the case. Open source absolutely relies on some critical mass of people playing by the same rules (Phillip Torrone famously started the process of trying to codify some common rules for open source hardware back in 2012). What makes up a “critical mass” will vary depending on the community and the nature of the work they are doing.

That being said, there are always going to be people outside of the community. Many of those people will benefit from the work of the community without contributing to it, or caring about it, or even being aware of its existence. Some of those people may even “exploit” the community (although, as Thea points out, that line can be a hard one to draw when you start looking at it closely). That’s ok, as long as there are enough people in the community to achieve its goals. To a first approximation, all open source communities operate in this environment.

Put another way, open source relies on enough people playing by the same rules to keep the community engaged. Someone outside of the community benefitting from the community’s work in a way the community views as unreasonable can absolutely cause people to leave. And if enough people leave there is a problem. But the mere existence of these people is not fatal to any open source community. Viewing open source community challenges as a universal compliance problem could cause you to come to all sorts of inaccurate conclusions.

Assertion #2: The Situation is Changing

The Prusa post also says:

But in recent years, I feel that the situation is changing. More and more companies are breaking and bending the rules, and the community is not nearly as resistant to their actions as it once was.

(emphasis in original)

This is a strong assertion. If it is true it would be very interesting. However, I don’t know that we have very much evidence of a major change (yet).

Open source hardware companies have struggled with actors outside of the community for years (“Cloning ain’t cool” is one of the unofficial rules from 2012). We’ve seen claims that companies are breaking the rules - and that rule breaking is pushing open source hardware companies away from being open - since at least the time of Makerbot’s decision to back away from open source.

Are things different now? Maybe? On its face, the concerns that the Prusa post raises seem fairly similar to those raised by Bre Pettis around Makerbot clones in 2012. They have been echoed by other companies in the decade since. Like Prusa, I suspect many other open source hardware company leaders have lamented that “After a minor internet storm, the situation calms down, and the code remains closed (or only part of it is opened), and after a few weeks, everyone forgets.” after their hardware is cloned by a competitor.

I would actually love to hear more from Prusa about how he understands the nature of this dynamic to have changed over the years. Absent that, while the situation may feel different within Prusa Research, I’m not sure we have enough evidence to say that it has changed for open source hardware in general. And if things have changed within Prusa Research, I’d love to know more about that too!

The Legal Parts (Some Questions About Why This is Necessary)

I think it is useful to think about new open source licenses, about the goals you want them to achieve, and not get too bogged down in whether or not those new licenses are goals are compatible with “open source” as a platonic ideal (that’s part of the reason I was excited to participate in creating the ml5.js ethical open source license).

However, licenses are not self-executing or self-enforcing. Nor are they the only way to enforce behavioral norms. They rely on people deciding to enforce them, and actual rights to license. Therefore, when thinking about creating a new license, it can be helpful to understand the current state of affairs and why that state is not working. To that end, I have a bunch of questions:

What is Prusa Research currently doing to enforce its existing rights, and how are those efforts falling short?

Trademarks are usually the most powerful rights an open source hardware company has. What trademarks does Prusa have? Does it take steps to patrol their use (say, on large online marketplaces)? If it is, and that is not working, why not? If it isn’t, why not?

(by the way, a panel on how successful open source hardware companies enforce their rights would be a super interesting Open Hardware Summit panel that I will absolutely be proposing next year)

What About the CERN Licenses?

Open source hardware companies often have copyrights as well. The Prusa post states (correctly) that the GPL license is not really optimized for open source hardware. However, while the GPL is imperfect when it comes to hardware, it does exist. Have there been violations of that license? If so, did Prusa pursue them? Why or why not? If not, why would a more restrictive license change their approach?

Perhaps more importantly, has Prusa looked at the new(ish) CERN licenses? How might its enforcement experience change under that regime? How do those licenses fail to address the problems in the market? Are those failures curable with different license language, or are they inherent to the types of intellectual property rights that attach to a 3D printer?

Are There Specific Examples of Bad Behavior That Could be Controlled by a License?

Some parts of 3D printers cannot be protected by any intellectual property rights (see this whitepaper for more). Without a right to license, the terms of the license do not matter. Therefore, it would be helpful for Prusa to provide specific (or specific-ish) examples of what it considers bad behavior that would violate a more restrictive hardware license.

The Prusa post talks about not releasing electronics plans for the new MK4 printer. I don’t know how easy those boards would be to reverse engineer. I do know that it is unlikely that they are protected by any sort of intellectual property right that could be licensed in a way that would control other users. Thinking about new licenses would be easier with a clear understanding of the type of behavior that the license is intended to control.

The Legal Parts (Some Questions about the License Working Points)

The end of the Prusa post includes a list of working points that could form the core of a new license. In thinking about how to incorporate them into a license, I have some clarifying questions:

How would you think about defining “clearly stating” authorship on the product or software?

If you’re using some code or blueprints to bring software or hardware to market, the original code’s authorship must be clearly stated on the product or in the software. Additionally, deleting copyright information from headers and history from repositories is prohibited.

Naming authors can be straightforward when there are only one or two. Slic3r has 103 contributors. PrusaSlicer is built on Slic3r and has 166 (I don’t know how many of those users overlap). Do each of these people need to be listed? What happens if they disagree about what it means to be “clearly” listed? Is it ok to provide a hyperlink to a page that lists all of them, or do their names (or handles, or both) need to be contained on the product itself? If it needs to be contained on the product, do we need to talk about minimum font sizes? If there is a fight about the clarity of notice, how is that fight resolved?

How would you think about defining a clone?

The production of nearly exact 1:1 clones for commercial purposes is not allowed.

In trying to understand if a clone is a copy, should we look at all of the elements of the first piece of hardware, or just the contributions that the first hardware manufacturer made? Does it matter if any of those contributions are eligible for any sort of intellectual property protection? Are those contributions listed somewhere so that someone knows how to avoid being accused of being a clone? Is “nearly” calculated as a raw number (“we made 10 contributions and you copied nine”) or as some sort of weighted percentage (“we made 10 contributions, and you copied the three most important ones”)?

What is a license for manufacturing spare parts actually licensing?

License for manufacturing spare parts is valid for service, modification, or educational purposes.

Upgrades and additional modifications based on original parts are allowed and welcome.

Parts that can be considered consumables (e.g., thermistors, heater blocks, fans, printing plates, etc.) can be manufactured and sold commercially after the verification by the licensor based on the presentation of samples.

Some points seem to assume that a manufacturer of a piece of hardware gets to control all third party parts for that hardware. This is not usually the case. In what situations would anyone actually need a license to manufacture spare parts? What about consumables (ask the manufacturers of 2D printers who keep trying - and failing - to force people to buy replacement toner from them how that is going)? What rights would someone be violating if they manufactured spare replacement parts/consumables without the license?

What does it mean to cease activity?

If the licensor ceases its activity, the non-commercial clause is terminated.

Sometimes companies totally disappear. Other times, they are acquired, or they merge, or their assets are put up for auction. The final point states that the non-commercial clause is terminated if the licensor ceases its activities. Why is ceasing activity a trigger for that, and should any of these kind-of-stopping-but-not-quite-stopping activities trigger it too?

What Does This License License?

There is one meta-question threaded through much of this post - what rights, exactly, would this new license be licensing? This question is always important (at least to lawyers like me) but it becomes more important as the license becomes more restrictive. I might not be sure if I am legally bound by the Creative Commons Attribution license on a piece of hardware, but it’s pretty easy to give attribution just in case. If the licensor is trying to limit non-commercial uses of that hardware, I will care a lot more about what might trigger that restriction (and how to sidestep it).

None of my questions - including this final one - are intended to be gotcha questions, or to nit pick this proposal to pieces. As I said at the top, I think this is an interesting conversation, and it is one I want to take seriously. For me, taking it seriously means thinking about what kinds of information would be helpful when considering it. That’s what I’ve tried to do here.

I hope this does end up being a real discussion in the open source hardware community. I’m also open to being convinced that we need to do something new. However, in order to be convinced, I need more information than I have right now.

Feature image: Political Discussion in a Lumber Shanty from the Smithsonian Open Access collection

March 10, 2023Michael Weinberg

Maybe LLMs Won’t Raise 230 Questions?

Recently I read two blog posts about the intersection of Section 230 and generative AI, specifically LLMs. While they are both interesting, I think they skip over a potentially limiting constraint on the importance of these questions: the blast radius of any specific piece of AI generated content on a website. Specifically, it seems plausible that blast radius - or damaging reach of the piece of content - may be fairly limited, which would reduce the likelihood that 230 protections end up being super relevant.*

I agree with Professor Matt Perault in Lawfare that Section 230 does not currently cover content generated by generative AI managed/hosted/whatever by a given website. I also agree with John Bergmayer over at the Public Knowledge blog that this state of affairs is a good one, at least for now.

Where I may disagree with them is how often this kind of thing is likely to come up in a context that feels 230-familiar. (I say “may” because both of them are focused on a different part of this analysis, so I don’t know how they feel about this). AI is already raising legal issues, and websites will host third party material created by AI. But today’s deployment of AI may not raise new 230 issues.

One standard 230 fact pattern is Person A posts content on Website B. Person C objects to that content (because it defames them, or causes them some other harm), and sues Website B for hosting it. In most cases, Section 230 allows Website B to step aside, telling Person C to sue Person A if they don’t like the content.

A key element of this pattern is usually that Person A’s potentially harmful content is available for many people to see. That makes the potential blast radius for harmful information quite large.

However, current generative AI usage patterns tend to be a bit different. Services like Microsoft Bing’s Sydney, or DuckDuckGo’s DuckAssist are designed to create custom content for an audience of one. That content can be hugely problematic. But in most cases the output isn’t available more broadly. That could severely limit the blast radius for the harmful information. A reduced blast radius makes it less likely the harmed party will know about the harm, and that the harm will be significant enough to justify a lawsuit.

Of course, there are two other obvious scenarios where this type of AI could create 230 issues. One is where Person A uses a generative AI service to create content, and then brings that content to Website B. In that case, the fact that generative AI was used to create the content should not be particularly relevant to Website B’s ability to get out of the suit. It would not make very much sense to have a 230 carveout for harmful content that happened to be created by generative AI.

Which brings us to the other scenario. If Person A uses Website D to generate the problematic content, Website D might be pulled into any related litigation. That seems like a fact pattern outside of 230, and pretty much what Bergmayer is contemplating in his piece. In that case, it does seem to be at least facially reasonable to allow a court to explore Website D’s liability for the content. That could even be true if the Website D content just stays on Website D. While it would be harder to discover and document, Website D creating millions of bespoke pieces of content that slander Person C does feel like something that Website D could be sued for.

*I am super aware that projecting the future impact of technology based on current use patterns can be a recipe for disaster. Sorry future Michael for any problems or embarrassment this post causes you!

Feature image: Little Billy Bryan Chasing Butterflies from the Smithsonian National Portrait Gallery. I’m not going to pretend that it contains some larger commentary about this post. I was poking around looking for an image, happened to see this, and obviously needed to use it.

January 27, 2023Michael Weinberg

Pioneers of Open Access Report

Earlier this month the Engelberg Center and Creative Commons released a report on GLAM institutions who were early to adopt open access policies. This post originally ran on the Creative Commons site. The paper hosted here is the same paper as hosted there. The only difference is that I simplified the file name from the original “Final-Pioneers-of-Open-Three-Case-Studies.pdf-correctedByPAVE.pdf” because, well.

Ever wondered how it must have been for some of the first cultural heritage institutions to embark on their open access journey? Michael Weinberg, Executive Director of the Engelberg Center on Innovation Law & Policy at NYU Law, talked to three major institutions that helped shape the early open GLAM / open culture movement to find out. Here’s what he found.

The list of Galleries, Libraries, Archives, and Museums (GLAMs) with open access programs gets longer every day. However, those programs don’t just happen. They are the result of work from teams inside and outside of the institution.

Like the commons they create, the open access programs build on one another. Each open access program launched today uses lessons learned from programs that came before.

“Pioneers of Open Culture” contains three case studies of open GLAM early adopters. It examines some of the institutions that created open access programs in the early days of the movement.

The National Gallery of Art (United States), Statens Museum for Kunst, and New York Public Library are different institutions. They have different funding models, different relationships to government, and different styles of public engagement. In the years since they started, their open access programs have taken different directions. However, all three pioneered their own versions of successful open access programs.

None of these institutions would claim to have built their programs alone. They were part of communities, discussions, and practices that evolved along with them. At the same time, these institutions navigated their environment with many fewer models than are available today. That forced them to learn lessons that today’s institutions can take for granted. These case studies help shed light on that process.

Pioneers of Open Culture is not a comprehensive analysis of each institution’s open access program. It also does not explore all of the institutions that contributed to the early days of the open culture movement. Instead, it is an exploration of how some of the people who created and operated these programs understood their work. The goal is to provide a window into the process. This window might help those who want to follow similar paths.

While each case study has conclusions specific to the institution, a few points of commonality do begin to emerge:

Digital Infrastructure Matters. Successful open access programs are built on digital foundations that directly incorporate rights and rights awareness. Digital systems redesigns were opportunities to build the possibility of open into an institution’s DNA. Well designed digital backends also made it easier to experiment with smaller projects that were not true one-offs, but rather closely integrated into the institution’s technology infrastructure.

Experimentation is Important. Collections are diverse, as are the users who are interested in them. Open access programs succeed when there is space to try new things, and create multiple points of entry into an institution’s collections. This is true for members of the public who want to explore the collection. It is also true of internal stakeholders who want to understand how open access can help them achieve their own goals. Space takes the form of financial support from within and without the institution. It also takes the space of an institutional environment that is welcoming to experimentation.

Make the Easy Things Easy. Open access programs can be challenging to construct and sustain. Technology must be built. Collections must be designed. Rights statuses must be documented. That makes it important to use tools that make things easier whenever they exist. Those tools include legal tools such as the CC0 public domain dedication, and technical tools such as open source software. The reliability of these tools allows teams to focus on the hard parts of creating open access collections.

“Pioneers of Open Culture” brings color and context to the history of open access. Hopefully, understanding that history can help accelerate open access programs yet to be created and encourage people to embark on better sharing of cultural heritage worldwide.

October 24, 2022Michael Weinberg

I’m Not Sure That (If?) GitHub Copilot is a Problem

Last week a new Github Copilot investigation website created by Matthew Butterick brought the conversation about GitHub’s Copilot project back to the front of mind for many people, myself included. Copilot, a tool trained on public code that is designed to auto-suggest code to programmers, has been greeted by excitement, curiosity, skepticism, and concern since it was announced.

The Github Copilot investigation site’s arguments build on previous work by Butterick, as well as thoughtful analysis by Bradley M. Kuhn at the Software Freedom Conservancy. I find the arguments contained in these pieces convincing in some places and not as convincing in others, so I’m writing this post in the hopes that it helps me begin to sort it all out.

At this point, Copilot strikes me as a tool that replaces googling for stack overflow answers. That seems like something that could be useful. It also seems plausible that training such a tool on open public software repositories (including open source repositories) could be allowed under US copyright law. That may change if or when Copilot evolves, which makes this discussion a fruitful one to be having right now.

Both Butterick and Kuhn combine legal and social/cultural arguments in their pieces. This blog post starts with the social/cultural arguments because they are more interesting right now, and may impact the legal analysis as facts evolve in the future. Butterick and Kuhn make related arguments, so I’ll do my best to be clear which specific version of a point I’m engaging with at any given time. As will probably become clear, I generally find Kuhn’s approach and framing more insightful (which isn’t to say that Butterick’s lacks insight!).

What is Copilot, Really?

A large part of this discussion seems to turn on the best way to think about and analogize what Copilot is doing (the actual Copilot page does a pretty good job of illustrating how one might use it).

Butterick seems to think that the correct way to think about Copilot is as a search engine that points users to a specific part of a specific (often open source) software package. In his words, it is “a convenient alternative interface to a large corpus of open-source code”. He worries that this “selfish interface to open-source software” is built around “just give me what I want!” (emphasis his).

The selfish approach may deliver users to what they think they want, but in doing so hides the community that exists around the software and removes critical information that the code is licensed under an open source license that comes with obligations. If I understand the argument correctly, over time this act of hiding the community will drain open source software of its vitality. That makes Copilot a threat to open source software as a sustainable concept.

But…

The concern about hiding open source software’s community resonates with me. At the same time, Butterick’s starting point strikes me as off, at least in terms of how I search for answers to coding questions.

This is probably a good place to pause and note that I am a Very Bad coder who, nonetheless, does create some code that tends to be openly licensed and is just about always built on other open source code. However, I have nowhere near the skills required to make a meaningful technical contribution to someone else’s code.

Today, my “convenient alternative interface” to finding answers when I need to solve coding problems is google. When I run into a coding problem, I either describe what I am trying to do or just paste the error message I’m getting into google. If I’m lucky, google will then point me to stack overflow, or a blog post, or documentation pages, or something similar. I don’t think that I have ever answered a coding question by ending up in a specific portion of open source code in a public repo. If I did, it seems unlikely that code - even if it had great comments - would get me where I was going on its own because I would not have the context required to quickly understand that it answered my question..

This distinction between “take me to part of open source code” (Butterick’s view) and “help me do this one thing” (my view) is important because when I look at the Copilot website, it feels like Copilot is currently marketed as a potentially useful stack overflow commenter, not someone with an encyclopedic knowledge of where that problem was solved in other open source code. Butterick experimented with Copilot in June and described the output as “This is the code I would expect from a talented 12-year-old who learned about JavaScript yesterday and prime numbers today.” That’s right at my level!

If you ask Copilot a question like “how can I parse this list and return a different kind of list?,” in most cases (but, as Butterick points out, not all!) it seems to respond with an answer synthesized from many different public code repositories instead of just pointing to a single “best answer” repo. That makes Copilot more of a stack overflow explorer than a public code explorer, albeit one that is itself trained by exploring public code. That feels like it reduces the type of harm that Butterick describes.

Use at Your Own Risk

Butterick and Kuhn also raise concerns about the fact that Copilot does not make any guarantees about the quality of code it suggests. Although this is a reasonable concern to have, it does not strike me as particularly unique to Copilot. Expecting Copilot to provide license-cleared and working code every time is benchmarking it against an unrealistic status quo.

While useful, the code snippets I find in stack overflow/blog post/whatever are rarely properly licensed and are always “use at your own risk” (to the extent that they even work). Butterick and Kuhn’s concerns in this area feel equally applicable to most of my stack overflow/blog post answers. Copilot’s documentation if fairly explicit about the value of the code it suggests (“We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn’t write yourself.”), for whatever that is worth.

Will Copilot Create One Less Reason to Interact Directly with Open Source Code?

In Butterick’s view, another downside of this “just give me what I want” service is that it reduces the number of situations where someone might knowingly interact with open source code directly. How often do most users interact directly with open source code? As noted above, I interact with a lot of other people’s open source software as an extremely grateful user and importer of libraries, but not as a contributor. So Copilot would shift my direct deep interaction with open source code from zero to zero.

Am I an outlier? Nadia Asparouhouva (née Eghbal)’s excellent book Working in Public provides insight into open source software grounded in user behavior on Github. In it, she tracks how most users of open source software are not part of the software’s active developer community:

“This distribution - where one or a few developers do most of the work, followed by a long tail of casual contributors, and many more passive users - is now the norm, not the exception, in open source.”

She also suggests that there may be too much community around some open source software projects, which is interesting to consider in light of Butterick’s concern about community depletion:

”The problem facing maintainers today is not how to get more contributors but how to manage a high volume of frequent, low-touch interactions. These developers aren’t building communities; they’re directing air traffic.”

That suggests that I am not necessarily an outlier. But maybe users like me don’t really matter in the grand scheme of open source software development. If Butterick is correct about Copilot’s impact on more active open source software developers, that could be a big problem.

Furthermore, even if users like me are representative today, and Copilot is not currently good enough to pull people away from interacting with open source code, might it be in the future?

“Maybe?” feels like the only reasonable answer to that question. As Kuhn points out, “AI is usually slow-moving, and produces incremental change far more often than it produces radical change.” Kuhn rightly argues that slow-moving change is not a reason to ignore a possible future threat. At the same time, it does present the possibility that a much better Copilot might itself be operating in an environment that has been subject to other radical changes. These changes might enhance or reduce that future Copilot’s negative impacts.

Where does that leave us? The kind of casual interaction with open source code that Butterick is concerned about may happen less than one might expect. At the same time, today’s Copilot does not feel like a replacement for someone who wants to take a deeper dive into a specific piece of open source software. A different version of Copilot might, but it is hard to imagine the other things that might be different in the event that version existed. Today’s version of Copilot does not feel like it quite manifests the threat described by Butterick.

Copilot is Trained on Open Source, Not Trained on Open Source

For some reason, I went into this research thinking that Copilot had explicitly been trained on open source software. That’s not quite right. Copilot was trained on public Github repositories. Those include many repositories of open source software. They also include many repositories of code that is just public, with no license, or a non-open license, or something else. So Copilot was trained on open source software in the sense that its training data includes a great deal of open source software. It was not trained on open source software in the sense that its training data only consists of open source software, or that its developers specifically sought out open source software as training data.

This distinction also happens to highlight an evolving trend in the open source world, where creators conflate public code with openly licensed code. As Asparouhouva notes:

”But the GitHub generation of open source developers doesn’t see it that way, because they prioritize convenience over freedom (unlike free software advocates) or openness (unlike everly open source advocates). Members of this generation aren’t aware of, nor do they really care about, the distinction between free and open source software. Neither are they fired up about evangelizing the idea of open source itself. They just publish their code on GitHub because, as with any other form of online content today, sharing is the default.”

As a lawyer who works with open source, I think the distinction between “openly/freely licensed” and “public” matters a lot. However, it may not be particularly important to people using publicly available software (regardless of the license) to get deeper into coding. While this may be a problem that is exacerbated by Copilot, I don’t know that Copilot fundamentally alters the underlying dynamics that feed it.

Is This Legal?

As noted at the top, and attested to by the body of this post so far, this post starts with the cultural and social critiques of Copilot because that is a richer area for exploration at this stage in the game. Nonetheless, the critiques are - quite reasonably - grounded in legal concerns.

Fair Use

The legal concerns are mostly about copyright and fair use. Normally, in order to make copies of software, you need permission from the creator. Open source software licenses grant those permissions in return for complying with specific obligations, like crediting the original creator.

However, if the copy being made of the software is protected by fair use, the copier does not need permission from the creator and can ignore any obligations in a license. In this case, Github is not complying with any open source licensing requirements because it believes that its copies are protected by fair use. Since it does not need permission, it does not need to copy with license requirements (although sometimes there are good reasons to comply with the social intent of licenses even if they are not legally binding…). It has said as much, although it (and its parent company Microsoft) has declined to elaborate further.

I read Butterick as implying that Github and Microsoft’s silence on the details of its fair use claim means that the claim itself is weak: “Why couldn’t Microsoft produce any legal authority for its position? Because [Kuhn and the Software Freedom Conservancy] is correct: there isn’t any.”

I don’t think that characterization is fair. Even if they believe that their claim is strong, Github cannot assume that it is so strong as to avoid litigation over the issue (see, e.g. the existence of the Github Copilot investigation website itself). They have every reason to avoid pre-litigating the fair use issue via blog post and press release, keeping their powder dry until real litigation.

Kuhn has a more nuanced (and correct, as far as I’m concerned) take on how to interpret the questions: “In fact, these areas are so substantially novel that almost every issue has no definitive answers”. While it is totally reasonable to push back on any claims that the law around this question is settled in Github’s favor (Kuhn, again, “We should simply ignore GitHub’s risible claim that the “fair use question” on machine learning is settled.”), that is very different than suggesting that it is settled against Github.

How will this all shake out? It’s hard to say. Google scanned all the books in order to create search and analytics tools, claiming that their copies were protected by fair use. They were sued by The Authors Guild in the Second Circuit. Google won that case. Is scanning books to create search and analytics tools the same as scanning code to create AI-powered autocomplete? In some ways yes? In other ways no?

Google also won a case before the Supreme Court where they relied on fair use to copy API calls. But TVEyes lost a case where they attempted to rely on fair use in recording all television broadcasts in order to make it easy to find and provide clips. And the Supreme Court is currently considering a case involving Warhold paintings of Prince that could change fair use in unexpected ways. As Kuhn noted, we’re in a place of novel questions with no definitive answers.

What About the ToS?

As Franklin Graves pointed out, it’s also possible that Github’s Terms of Service allow it to use anything in any repo to build Copilot without worrying about addition copyright permissions. If that’s the case, they won’t even need to get to the fair use part of the argument. Of course, there are probably good reasons that Github is not working hard to publicize the fact that their ToS might give them lots of room when it comes to making use of user uploads to the site.

Where Does That Leave Things?

To start with, I think it is responsible for advocates to get out ahead of things like this. As Kuhn points out:

”As such, we should not overestimate the likelihood that these new systems will both accelerate proprietary software development, while we simultaneously fail to prevent copylefted software from enabling that activity. The former may not come to pass, so we should not unduly fret about the latter, lest we misdirect resources. In short, AI is usually slow-moving, and produces incremental change far more often than it produces radical change. The problem is thus not imminent nor the damage irreversible. However, we must respond deliberately with all due celerity — and begin that work immediately.”

At the same time, I’m not convinced that Copilot is a problem. Is it possible that a future version of Copilot would starve open source software of its community, or allow people to effectively rebuild open source code outside of the scope of the original license? It is, but it seems like that version of Copilot would be meaningfully different from the current version in ways that feel hard to anticipate. Today’s Copilot feels more like a fast lane to possibly-useful stack overflow answers than an index that can provide unattributed snippets of all open source software.

As it is, the acute threat Copilot presents to open source software today feels relatively modest. And the benefits could be real. There are uses of today’s Copilot that could make it easier for more people to get into coding - even open source coding. Sometimes the answer of a talented 12 year old is exactly what you need to get over the hump.

Of course, Github can be right about fair use AND Copilot can be useful AND it would still be quite reasonable to conclude that you want to pull your code from Github. That’s true even if, as Butterick points out, Github being right about fair use means that code anywhere on the internet could be included in future versions of Copilot.

I’m glad that the Software Freedom Conservancy is getting out ahead of this and taking the time to be thoughtful about what it means. I’m also curious to see if Butterick ends up challenging things in a way that directly tests the fair use questions.

Finally, this entire discussion may also end up being a good example of why copyright is not the best tool to use against concerns about ML dataset building. Looking to copyright for solutions has the potential to stretch copyright law in strange directions, cause unexpected side effects, and misaddressing the thing you really care about. That is something that I am always wary of, and a pior that informs my analysis here. Of course, Amanda Levandowski makes precisely the opposite argument in her article Resisting Face Surveillance with Copyright Law.

Image: Ancient Rome from the Met’s open access collection.

June 11, 2022Michael Weinberg

Lincoln Hand Shifter Knob

gif of Lincoln hand shifter as installed

In the interest of celebrating the weirdness of open data, I want to share a quick project that exists because of open data: Abraham Lincoln’s left hand as the shifter knob of a 1995 Mazda truck.

The whole thing was pretty strightforward. In fact, the hardest part was probably finding the right shifter knob adapter for the truck. All that was required was:

Download the Lincoln hand scans from the Smithsonian open access site.
Use tinkercad to put a hole in the back of the hand.
3D print the hand and use epoxy to attach the adapter.

spinning gif of combined version

Install it.

image of Lincoln hand shifter as installed

Michael Weinberg

I put things here so they are on the internet

Goals are Important

Assertions I am Less Sure About

Assertion #1: Open Source Relies on Everyone Playing by the Rules

Assertion #2: The Situation is Changing

The Legal Parts (Some Questions About Why This is Necessary)

What is Prusa Research currently doing to enforce its existing rights, and how are those efforts falling short?

What About the CERN Licenses?

Are There Specific Examples of Bad Behavior That Could be Controlled by a License?

The Legal Parts (Some Questions about the License Working Points)

How would you think about defining “clearly stating” authorship on the product or software?

How would you think about defining a clone?

What is a license for manufacturing spare parts actually licensing?

What does it mean to cease activity?

What Does This License License?

What is Copilot, Really?

But…

Use at Your Own Risk

Will Copilot Create One Less Reason to Interact Directly with Open Source Code?

Copilot is Trained on Open Source, Not Trained on Open Source

Is This Legal?

Fair Use

What About the ToS?

Where Does That Leave Things?