Last week the GLAM-E Lab released a new report Are AI Bots Knocking Cultural Heritage Offline?. The short answer seems to be “yes”. The longer answer fills a report.
The report exists because we started seeing one-off accounts from online cultural heritage collections that swarms of bots were knocking the collections offline. The bots were overwhelming the site as they scraped it for data to include in the datasets used to train AI models. The goal of the report is to start to understand if these stories were outliers, or just the early rumblings of something bigger.
After talking to dozens of cultural institutions around the world, it is pretty clear that the early descriptions were the early rumblings of something bigger. The online collections are starting to strain, and things might be worse before they get better.
While this is bad, I do think there is some room for optimism. In the medium- to long-term, it is in everyone’s interest to keep these collections online. The entities scraping the collections want them to remain available so that they can keep scraping them, and the entities that support the collections want them to remain available because it is part of their mission. The current practice of swarming large numbers of bots in short periods of time (thus creating an overwhelming amount of traffic) could easily be spread out over a sustainable longer period of time. As the players creating these datasets stabilizes, it is not hard to imagine incentives aligning to adopt some sort of crawl-delay standard.
I also think this is an interesting problem because it is (at least conceptually) severable from more complex debates around the relationship between these collections and generative AI more broadly. Those debates – which the report describes broadly as “policy” debates – center on the nature of open collections, what it means for the commons to be integrated into models, and how (or if) copyright is relevant to that conversation. In contrast, this problem is more technical in nature: how do we keep collections online and available in a sustainable way?
The report has a lot more detail on what collections are experiencing right now. I hope it acts as a useful snapshot of a moment in time that can be used as a reference point in the future.