Back to One of My Hobbyhorses

September 22, 2023 1:33 p.m.

Start your day with TPM.

While English-language AI is gobbling up much of the online English language almost always without permission, there’s a problem for Danish AI, reports Bloomberg. Apparently, most of the Danish web is under pretty stringent copyright protections. And Danish law makes the kind of recourse-less stealing that Silicon Valley AI companies are getting away with way too hard. Government records and legislation are in the public domain. But that formal Danish is too distant from how people really speak and write to serve the purpose. The solution turns out to be horses.

For reasons that aren’t entirely clear, the Danish web evolved in such a way that a discussion board about horses – heste-nettet.dk – ended up becoming one of the most popular and heavily used forums in the language. It’s mostly about horses, but because it’s so big, it spawned questions and answers and conversations about a whole range of topics. It seems to be kind of a Danish Reeddit, only with horses always being the Big Cheeses or Big Men on Campus at the forum. If anything, that seems to understate it. When I visited the site trying to use my trivial grasp of Old English, it seems to still be heavily about horses and equestrian stuff. The upshot of all this is that Danish AI will likely have a strong bias toward horses and horse adjacent topics.

Back here in the anglophone world, many publishers are now taking steps to block AI companies from harvesting their content. Here’s a small anecdote about how that’s going. When I found out about this blocking I thought we should block them too. To anticipate the churlish questions I sometimes get on this front, no, I’m not holding out for some eleven dollar royalty check. As a practical matter of use-rights or money, I couldn’t care less. But, as a matter of principle, I think we should make some effort given how big a deal I’ve made out of it.

With most kinds of digital scraping, a website publisher can put a kind of digital note in the data that instructs bots not to slurp from that pile of content. For instance, you can tell Google not to scan your site for their search engine. Few people do that, for obvious reasons, but you can if you want. When it comes to the AI folks, though, it’s quite different.

The publishers who are blocking AI bots from harvesting their sites are having to go to some lengths to block them. Just telling them to skip your site won’t work. Why? Because no one wants to have their content stolen to build AI models that will make other people into billionaires. In other words, the idea that AI is being built on data which the AI makers don’t have permission to use is no longer notional. No one wants it to happen. And the big players are investing a non-trivial amount of effort and expense to block it. It’s the difference between posting a “no solicitation” sign and installing heavy duty security to prevent people from coming in.

For us, the time and effort was prohibitive. Whatever … it was simply an idea to express some solidarity with the anti-AI, anti-thievery cause. But it does give you a feel for the ethics and standing of that new industry.

Start your day with TPM.

Josh Marshall (@joshtpm) is the founder and Editor-in-Chief of TPM.

Have a tip? Send it Here!