"It’s not realistic for a corporation, even a multinational titan with a large s...

TFNA · 2026-05-23T01:38:33 1779500313

It is precisely because Anna has such incredible breadth that corporations should use those materials to train their LLMs; it is a public good. I work in an areal-studies field and my colleagues and I resolved some years ago to scan and OCR our entire departmental libraries and upload the books to the shadow libraries, copyright be damned. When these corporations then trained their LLMs on the shadow libraries, the LLMs 1) automatically learned several minority languages, and 2) learned quite a bit about parts of the world that were little represented on the internet.

So for the first time, peoples who had generally been left out in the internet age are now able to perform queries in their own languages, and people from elsewhere doing queries now get to draw also on the information from these parts of the world. This would have never realistically happened under any copyright-respecting project that painstakingly sought author or publisher permission; there just will never be sufficient manpower or funding for specifically that.