I mentioned yesterday that Twitter became unstable over the weekend because of evil forces which Elon Musk and crew had to combat — that is, according to Elon Musk. Mainly that’s just BS from a company that has stiffed service providers and cut staff to a level that they couldn’t keep the site online anymore. But there likely are more efforts to “scrape” Twitter and other platforms going on right now. (Scraping here means bots which scan through a site and collect copies of its public or non-restricted data.) So there’s some element of truth to Musk’s claims that they’re facing more demands on the site’s capacity. The key is that that only becomes a big problem if you’re running very close to the edge already. But I want to zero in a bit on this point.
In a discussion this morning of the changing, buckling web, Axios notes that platforms are “trying to shut their technical gates so others can’t gobble up troves of data for AI models to study.” I noticed elsewhere that Google has changed its terms of service to note that it might use gobbled-up data for its AI engines — though this might be from services like gmail that you’ve already given permission for as the ‘cost’ of using their free email service.
These moves point to the legal grey area which is almost certainly about to become highly, highly contested — and rightly so. Should sites really have to wall off their mountains of text so that AI companies can’t gobble it up and use it to build AI? That makes no sense. If they don’t have permission they shouldn’t be able to do it at all. What legal right do the AI companies have to do that? At the moment they’re just doing it because no one knows how to stop them and the law is unsettled about just what AI “use” means.
It’s especially crazy if they are not only stealing the data but also making it harder for the sites and platforms to remain online because it’s hard for those sites and platforms to service the data requests of the folks stealing from them.
Years ago a related set of questions was hashed out with search engines. Does Google have a right to scan your data and excerpt small portions of it for the purposes of search? The upshot of that process was that yes, they do, mostly because search doesn’t diminish the value of your stuff and search is a social utility. Everybody gains from knowing where everything is. (Needless to say I’m vastly simplifying this and the path for this result was greatly smoothed by the armies of Google lawyers who helped it happen.)
But is AI the same? There are very good reasons to think it’s not. As we discussed in another post, AI engines are consuming vast quantities of visual art for the purpose of training AI engines to create new visual art. In other words, AI engines are gobbling up visual art for the purpose of putting the creators of that kind of art out of business. It’s not precisely the same with text. But it’s not that different either.
As I’ve been making these arguments, some people have assumed that since my company owns a significant archive of writing over 20+ years that I’m worried about being replaced by a machine or somehow wanting to charge for access to our archives. This is absolutely not the case. I assure you. I don’t have any expectation that I or TPM is going to gain some sort of usage fees or royalties off our back catalog. That thought had never even occurred to me. It’s also not a more general push to create a new path for IP rents for other people. My point is more general and politico-economic.
There’s currently a gold rush to build the big AI machines. And the folks who are doing it are collecting the building blocks for free — literally. More generally, most of the rest of the society is just a passive and to a real extent apprehensive observer of the workings of a few AI kingpins in Silicon Valley. I’m a bit of a skeptic about all the dystopian visions of what AI is purportedly going to create. But it’s more than a bit weird hearing the people creating AI say both that it presents an existential threat to humanity itself while simultaneously insisting that there be a huge rush to bring it to life right now.
Creating AI is clearly a pretty big deal — something that few of us individually seem to have any real power to affect. It’s also clear that the unimaginable wealth gains that come from it are going to go disproportionately to a very few people. But the building blocks for it are actually owned by everyone else. Lots of it is owned by big corporations; some owned by small corporations; a significant amount is either owned by or at least created by ordinary folks like you and me. So the point about making a big deal over AI hoovering up all the data they don’t own isn’t to get some pennies or residual checks here or there. It’s to give a lot more people a seat at the table in deciding how all this unfolds. It’s to prevent a situation where a few well-placed people get to bogart the collective imaginings and intellection of our world at no cost and then spend the next century selling it back to us at top-shelf prices. Unimaginable fortunes will certainly be made regardless. But it’s worth trying to introduce a bit more complexity, and to add some seats at the table, at the beginning of the process.