Are AI Bots Knocking Cultural Heritage Offline?

Last week the GLAM-E Lab released a new report Are AI Bots Knocking Cultural Heritage Offline?. The short answer seems to be “yes”. The longer answer fills a report.

The report exists because we started seeing one-off accounts from online cultural heritage collections that swarms of bots were knocking the collections offline. The bots were overwhelming the site as they scraped it for data to include in the datasets used to train AI models. The goal of the report is to start to understand if these stories were outliers, or just the early rumblings of something bigger.

After talking to dozens of cultural institutions around the world, it is pretty clear that the early descriptions were the early rumblings of something bigger. The online collections are starting to strain, and things might be worse before they get better.

While this is bad, I do think there is some room for optimism. In the medium- to long-term, it is in everyone’s interest to keep these collections online. The entities scraping the collections want them to remain available so that they can keep scraping them, and the entities that support the collections want them to remain available because it is part of their mission. The current practice of swarming large numbers of bots in short periods of time (thus creating an overwhelming amount of traffic) could easily be spread out over a sustainable longer period of time. As the players creating these datasets stabilizes, it is not hard to imagine incentives aligning to adopt some sort of crawl-delay standard.

I also think this is an interesting problem because it is (at least conceptually) severable from more complex debates around the relationship between these collections and generative AI more broadly. Those debates – which the report describes broadly as “policy” debates – center on the nature of open collections, what it means for the commons to be integrated into models, and how (or if) copyright is relevant to that conversation. In contrast, this problem is more technical in nature: how do we keep collections online and available in a sustainable way?

The report has a lot more detail on what collections are experiencing right now. I hope it acts as a useful snapshot of a moment in time that can be used as a reference point in the future.

Does an AI Dataset of Openly Licensed Works Matter?

A team just announced the release of the Common Pile, a large dataset for training large language models (LLMs). Unlike other datasets, Common Pile is built exclusively on “openly licensed text.” On one hand, this is an interesting effort to build a new type of training dataset that illustrates how even the “easy” parts of this process are actually hard. On the other hand, I worry that some people read “openly licensed training dataset” as the equivalent of (or very close to) “LLM free of copyright issues.”

There is an active fight over whether or not training an LLM on data infringes on the copyright in that data. If the activity is protected by fair use (in the US) the license on the work does not matter because the trainer does not need permission from the rightsholder. If the activity is not protected by fair use, the license on the work matters a lot. The only way to train without infringing would be to do so within the scope of a license (open, closed, or otherwise).

In response to this, there are a number of efforts to build openly licensed datasets. The theory is that, if it turns out that training models requires permission from the data rightsholders, open data comes with that permission built in.

There are a few problems, or at least complications, with this approach. I am not suggesting that the Common Pile team in particular has failed to wrestle with these. Instead, this is a generalized list of problems that have come up across conversions around open training datasets.

  1. Openly licensed does not mean free from restrictions. Very few open licenses are public domain dedications. At a minimum, most of them require some sort of attribution to the original creator.
  2. It is not clear what attribution means in the case of LLMs or their outputs. At least conceptually, including attribution information in a dataset is fairly straightforward - it’s just metadata. That is less straightforward when it comes to the models themselves. What does it mean to give attribution to the 2 trillion tokens used to train a model? Attribution is even harder when it comes to outputs. What’s the best way for an output to provide attribution for the 2 trillion tokens involved in training the model that produced it? And these are just the easy cases. These questions get harder the more you think about them.
  3. Do we even want to import attribution practices into training datasets? The core concept of attribution has proven to be incredibly useful across a range of open communities. However, it may not be the right solution to every problem. I keep coming back to a post by Kate Downing that points out that attribution requirements for open source software does not necessarily scale very well. If we know attribution does not scale well at the scale of open source software, do we want to bring it into the exponentially larger LLM context? Maybe? But also maybe not? It strikes me as a question that deserves more thought before we just assume it makes sense.

If there is not a way to comply with requirements of an open license, is spending time building an openly licensed training dataset worthwhile?

The answer may very well be “yes”. However, I worry that there is a tendency to jump directly from “this is an openly licensed dataset” to “we can use it without worrying about copyright”. At a minimum, there is value in articulating how an openly licensed dataset would interact with a goal of creating a model free from copyright-based constraints. Of course, I’m always going to say that people need to spend more time thinking about the copyright aspects of their work…

Coming back to the Common Pile paper specifically, as I mentioned at the top, it does a great job of showing how even easy things are hard. Faced with the challenge of building an openly licensed dataset, it is intuitive to start with platforms that host a lot of openly licensed content. Section 2 of the paper is all about the traps waiting for you on those platforms. What about people who post things under an open license when they are not the rightsholders? Which licenses count as open? What about collection-level licenses that do not reflect the licensing of the included works?

This stuff is hard enough to do by hand that I wrote a whole blog post about clearing a single image from a training dataset. What should it mean when someone makes the inevitable mistakes at scale? If the dataset has 2 billion tokens, how many not-actually-openly-licensed tokens should spoil the set? 10? 10 million? What if the 10 are owned by Disney?

Hero Image: A portion of “Kriegsanleihe” from the Smithsonian Open Access Collection

Pi-Powered Berlin BVG Alerts

Moving from NYC to Berlin gave me an excuse to update my old Pi-Powered MTA Subway Alerts project for the BVG. Now, as then, the goal of the project is to answer the question “if I leave my house now, how long will I have to wait for my subway train?”. Although, in this case, instead of just answering that question about the subway train, it also answers it for trams.

The full repo is available here.

The project uses a raspberry pi zero to connect to the BVG real time arrival data with neopixels. The neopixels give you an indication of how far trains are away from the station. Importantly, the alerts are not based on the absolute time until the train arrives at the station (”A train will arrive at the station in 5 minutes”). Instead, the alerts are aware of how long it takes to walk to the station from my apartment and are therefore based on the time from the station when you get there (”If you leave now and walk to the station, there will be a train arriving in the station 5 minutes after you get there.”).

image of indicator light, with led strips for each line and text explaining what some of the lights represent

For example, the strip coming down from the top represents the southbound M10 tram. The light at the bottom of that strip (the light closest to the center) will be on when the tram is coming “now,” with “now” being defined as “if you leave the apartment now and start walking towards the station, the tram will be at the platform when you get there.” Similarly, “2 minutes” means “you can wait 2 minutes to leave” or “you will wait at the station for 2 minutes if you leave now.”

Everything is basically the same as the MTA version of the project, except that I am using a strip of LEDs instead of individually soldering them. This is much easier!

Install Neopixels on Pi

There are some tricks to setting up the neopixels to work on the pi. Here are the steps, which are spread across a few adafruit explainers:

  1. Install blinka library to be able to use circutpython : https://learn.adafruit.com/circuitpython-on-raspberrypi-linux/installing-circuitpython-on-raspberry-pi (You need to turn on the virtual environment every time: source env/bin/activate)

  2. install neopixel library: https://learn.adafruit.com/neopixels-on-raspberry-pi/python-usage

  3. do the things required to use sudo for the neopixel library: https://learn.adafruit.com/python-virtual-environment-usage-on-raspberry-pi/usage-with-sudo (sudo -E env PATH=$PATH python3 neo_test.py)

  4. make it run at startup: https://learn.adafruit.com/python-virtual-environment-usage-on-raspberry-pi/automatically-running-at-boot

The Code

The code starts with a bunch of settings, including the station you are pulling data about, how long it takes to get from wherever you are to the platforms, and lists for the lines you are tracking.

grabber()

After that, the code is basically two functions. grabber() gets all of the departure times related to the lines and puts them in the lists. Before putting each departure time in the list, it adjusts it based on the travel time. For example, if your tram_walk_time is 6 and a tram is scheduled to leave in 10 minutes, grabber() will add it to the list as 4 because, by the time you walk to the station, the train will be leaving in 4 minutes.

If you are customizing this, you will need to update all of the if statements that parse the train info so they are pulling data for the correct lines and directions. For example, here is the if statement for the U5:

if i['line']['name'] == 'U5':
     if i['direction'] == 'Hönow':
          u5_eastbound.append(get_modified_departure_time(i['when'], i['line']['productName']))
     elif i['direction'] == "S+U Hauptbahnhof" or "Hauptbahnhof":
          u5_westbound.append(get_modified_departure_time(i['when'], i['line']['productName']))
     else:
          error_direction = i['direction']
          print(f'unexpected U5 direction: {error_direction}')

first it finds all of the entries for the U5:

if i['line']['name'] == 'U5':

then it looks for trains in the direction of Hönow:

if i['direction'] == 'Hönow':

then it runs the get_modified_departure_time() function to get the modified departure time (the actual departure time modified by how long it takes to get to the station) and appends it to the u5_eastbound list created at the top of the script.

The same thing happens for trains headed towards Hauptbahnhof, with an error message if something goes wrong.

Once grabber() is done, each of the line lists are full of modified times for trains.

lighter()

The second function, lighter() uses the lists to light up the neopixels. The logic for which lights turn on based on the train time is in this block of code:

for i in arrival_list:
     if 0 <= i <= 1:
          #light the corresponding light
          pixels[light_1] = light_color
          #remove the light from the list so it does not go black
          if light_1 in light_list: light_list.remove(light_1)
     elif 2 <= i <= 3:
          pixels[light_2] = light_color
          if light_2 in light_list: light_list.remove(light_2)
     elif 4 <= i <= 7:
          pixels[light_3] = light_color
          if light_3 in light_list: light_list.remove(light_3)
     elif 8 <= i <= 12:
          pixels[light_4] = light_color
          if light_4 in light_list: light_list.remove(light_4)
     elif 13 <= i <= 20:
          pixels[light_5] = light_color
          if light_5 in light_list: light_list.remove(light_5)
     else:
          pass 

You can set the bands to be whatever you want by changing the values on the elif lines. For example

elif 2 <= i <= 3:
     pixels[light_2] = light_color
     if light_2 in light_list: light_list.remove(light_2)

means that if the train arrival time is between 2 and 3 minutes away (elif 2 <= i <= 3:), the second light for the line will turn on the appropriate color (pixels[light_2] = light_color). You could change elif 2 <= i <= 3: to elif 1 <= i <= 10: if you wanted a wider band or whatever.

The Loop

Now that the functions are set, the script just loops.

First it checks the current time:

now_time = datetime.now()

I only need the lights running during the day, so it then checks to see if the current time is between 8am and 10pm:

if 8 <= now_time.hour < 22:

If it is, it runs grabber() and then lighter() for all of the lines. In addition to giving lighter() the line argument, it identifies which actual pixel in the strip is the “first,” “second,” etc. pixel for that color. The strip is continuous, so the 15th pixel in an absolute sense might be the first pixel in the U5 westbound set of lights.

lighter(u5_westbound, 'u5', 15, 16, 17, 18, 19)

If it is not during the day, the pixels all turn off:

else:	
     #turn them off 
     for i in range(number_of_pixels):
          pixels[i] = (0,0,0)
     print('lights out')

Finally, the script waits for 10 seconds before doing it all again:

time.sleep(10)

that seems long enough to keep the data reasonably accurate without hammering the bvg servers.

New Open GLAM Toolkit & Open GLAM Survey from the GLAM-E Lab

This post originally appeared on the Engelberg Center blog

Today the GLAM-E Lab, a collaborative project between the Engelberg Center and the University of Exeter (UK), is releasing a number of tools and resources for the open GLAM (Galleries, Libraries, Archives, and Museum) community.

First, the GLAM-E Lab has launched an Open GLAM Toolkit! This suite of tools, developed directly with GLAM organizations, can be used by any cultural organization to develop their own access program and release collections for public reuse. The toolkit even includes templates for model internal and external open access policies for setting up new workflows and website policies.

Second, today the GLAM-E Lab has also launched a website-based version of the Open GLAM Survey. The Survey’s new format makes it much easier to find, explore, and analyze open GLAM organizations around the world than was previously possible via the Google Spreadsheet format.

Third, both of these are only possible because of our collaborators’ engagement. The GLAM-E Lab model is to work directly with GLAM organizations to remove legal barriers to creating open access programs, and convert that work into the standard toolkits that other organizations can use. We set a goal to work with 24 different GLAM organizations by the end of 2024, and we’ve even exceeded that goal!

Finally, all of this work led to the GLAM-E Lab winning Wikimedia UK’s Partnership of the Year Award for 2024!

You can watch our announcement video on YouTube and find more details below on these announcements. Of course, if you or someone else would be interested in working with us in 2025, please let us know!

OpenGLAM Toolkit

The Open GLAM Toolkit is built on everything that we have learned from working with GLAM-E Lab collaborators. When used together, the toolkit resources will help cultural organizations identify, prepare, and publish their digital collections for open access using public domain or other machine readable statements. It includes:

OpenGLAM Survey 2.0

Version 2.0 of the OpenGLAM Survey brings the OpenGLAM Survey to a new, more user-friendly interface. You can sort organizations by type, licenses and the platforms used. The new interface also makes it easier for us to expand the survey and keep its data up to date.

We’ve Collaborated with More than 24 Organizations!

The GLAM-E Lab model is simple: work directly with individual organizations to remove legal barriers to open access programs, and turn what we learn during that work into standard tools and documents that organizations of any size can use.

Of course, all of this depends on having organizations that are open to tackling collections management issues with us in the first place. That’s why we are so excited to wrap up 2024 having worked with over 24 organizations on rights related issues and questions on open access. You can find the list of collaborators on the GLAM-E site.

hero image: Gereedschappen voor het vervaardigen van een mezzotint from the Rijksmuseum collection.

What Does an Open Source Hardware Company Owe The Community When it Walks Away?

This week Prusa Research, once one of the most prominent commercial members of the open source hardware community, announced its latest 3D printer. The printer is decidedly not open source.

That’s fine? My support of, and interest in, open source hardware is not religious. I think open source hardware can be an incredibly effective tool to achieve a number of goals. But no tool is fit for all purposes. If circumstances change, and open source hardware no longer makes sense, people and companies should be allowed to change their strategies as long as they are clear that is what they are doing. Hackaday does a good job of covering the Prusa-specific developments, and Phil has covered other examples (I hesitate to call it a ‘larger trend’ because I don’t think that’s quite right) on Adafruit.

Still, I do believe a company that builds itself on open hardware owes the community an honest reckoning as it walks out the door. Call it one last blast of openness for old time’s sake.

Specifically, I think the company should explain why openness does not work for them anymore. And not just by waiving their hands while chanting vaguely about unfair copying or cloning. They should seriously engage with the issue, explaining how their approach was designed, what challenges it faced, and why open strategies were not up to the task for overcoming those strategies.

This discussion and disclosure is not a punishment for walking away from open, or an opportunity for the community to get a few last licks in. Instead, it is about giving the community more information because that information might be useful to it. Open source hardware is about learning from each other, and how to run an open hardware business is just as important a lesson as how to create an open hardware PCB.

What Could This Look Like?

Last year Průša (the person) raised concerns about the state of open source hardware, framing his post as kicking off a “discussion.” Members of the community took that invitation seriously. I responded with a series of clarifying questions and comments. So did my OSHWA co-board member Thea Flowers, and Phil at Adafruit. Průša is under no obligation to respond to any one of these (me yelling “debate me!” on the internet does not create an obligation on the person to actually respond).

However, kicking off a self-styled discussion, having a bunch of people respond, and then doing . . . nothing does not feel like the most good faith approach to exploring these questions. None of the questions in the response posts were particularly aggressive or merely rhetorical - they were mostly calls for more clarity and specificity in order to inform a more thoughtful discussion.

Without that clarity, we are stuck in a vague space that does not really help anyone understand things better. As the hackaday article astutely points out:

The company line is that releasing the source for their printers allows competitors to churn out cheap clones of their hardware — but where are they?

Let’s be honest, Bambu didn’t need to copy any of Prusa’s hardware to take their lunch money. You can only protect your edge in the market if you’re ahead of the game to begin with, and if anything, Prusa is currently playing catch-up to the rest of the industry that has moved on to faster designs. The only thing Prusa produces that their competitors are actually able to take advantage of is their slicer, but that’s another story entirely. (And of course, it is still open source, and widely forked.)

If moving from open to closed prevents cheap clones, how does that actually work? That would be useful information to the entire open source hardware community! If it does not prevent cheap clones, why use that as a pretext? Also, useful information to the community!

Feature image: Political Discussion in a Lumber Shanty from the Smithsonian Open Access collection