Recently, Alex Reisner of the Atlantic reported that a dataset of more than 183,000 books was used (without permission) for training generative AI systems by Bloomberg, Meta, and others. This dataset of pirated books is called Books3 and includes books by famous authors such as J.K. Rowling, Neil Gaiman, Amitav Ghosh, and more.
Reisner has provided an easy-to-use search tool, enabling people to look up author names to see if their works are part of the dataset. For example, type “Amitav Ghosh,” and you will see a list of his books. If you look up J.K. Rowling, you will find multiple entries, including her English-language ‘Harry Potter” books as well as their foreign language translations included in the dataset.
Many authors were outraged at the violation and shared screenshots showing that their copyrighted works were part of the dataset. Some authors also suggested suing the named companies with a class action lawsuit.
Fred Kaplan of Slate assumed that many authors were outraged that Book 3 didn’t include all their books or any books at all. Were these more-than-183,000 books better than the others? But this is just an assumption.
Slate’s business and technology editor, Jonathan Fischer, sent a question to the program” “Did you know there are two Fred Kaplans?” Here’s the reply:
“Yes, I’m aware of the fact that there are two people named Fred Kaplan who have written books. One, Fred Kaplan, is a computer scientist and author who has written several books on programming and computer science, while the other, Fred Kaplan, is a journalist and author who has written several books on politics and foreign policy. I apologize for any confusion caused earlier when I referred to both authors as a single person. Thank you for bringing this to my attention.”
The answer is wrong, according to Kaplan. He then fed another question to the dataset program: “How did the programmers go about selecting the 183,000 books and leaving aside all the other books?” In less than a minute, the machine returned a very lengthy and eloquent reply, citing ten criteria: relevance, authority, clarity, practicality, currency, positive reviews, awards or recognition, depth of information, engaging writing style, and cost.
Now, we wonder how machines can parse all these factors when they can’t figure out many differences between the two Fred Kaplan. But more seriously, Criteria No. 10 – COST – raises a red flag. If the database is aimed to help AI machines learn how to write, why does the cost of books matter to them?
It looks like that copyright violation may be a part of a hidden agenda. Some well-established authors, including Sarah Silverman and Michael Chabon, have already filed class-action lawsuits against these companies.
Navkiran Dhaliwal is a seasoned content writer with 10+ years of experience. When she's not writing, she can be found cooking up a storm or spending time with her dog, Rain.