Generative AI is an innovative technology that holds the potential to reshape society, but it also raises important concerns regarding copyright infringement. In a recent development, authors Paul Tremblay and Mona Awad have taken legal action by filing a class action lawsuit against OpenAI, the parent company of ChatGPT. As TorrentFreak reported, the authors claim that their copyrighted works were used without permission in the training of ChatGPT, thus alleging copyright infringement and violations of the DMCA.
The evidence supporting their claim appears straightforward. Tremblay and Awad assert that they never granted OpenAI permission to utilize their works, yet ChatGPT is capable of providing accurate summaries of their writings. This fact suggests that the information must have been derived from somewhere. Although OpenAI has not disclosed the specific datasets employed in training ChatGPT, an older paper references “Books1” and “Books2” as sources, with the former containing approximately 63,000 titles and the latter around 294,000 titles.
However, the authors contend that legitimate databases with such extensive collections of books do not exist. They argue that OpenAI likely resorted to utilizing pirated resources from shadow library websites like Library Genesis (LibGen), Z-Library (Bok), Sci-Hub, and Bibliotik. These sites have notoriously aggregated books available for bulk download through torrent systems.
The complaint states, “Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.”
Based on these observations, the complaint alleges that OpenAI has infringed upon copyright laws. The plaintiffs are seeking statutory damages, which could amount to $150,000 per work. They are also considering additional damages related to the alleged removal of copyright management information, which would violate the DMCA.
While similar copyright claims have arisen in the past, this particular lawsuit highlights the allegation that OpenAI employed pirate websites for training data. Notably, Z-Library, a shadow library housing millions of pirated books, is currently facing criminal prosecution by the U.S. Department of Justice.
The resolution of copyright-related issues in the context of AI remains uncertain. Governments worldwide are adopting different approaches, with the U.S. Congress indicating a cautious stance. However, rights holders are actively pursuing their interests and are unlikely to remain passive.
Although there is no direct evidence implicating OpenAI in the use of pirate sites for training ChatGPT, it is known that some AI projects have utilized pirated material in the past, as highlighted by a comprehensive summary from Search Engine Journal. The media has also reported instances where AI models developed by Google and Facebook were trained on the “C4 dataset,” which included Z-Library and other pirate sites.
This lawsuit will undoubtedly attract significant attention from AI enthusiasts and rights holders alike. Its outcome may necessitate OpenAI to disclose aspects of its training data, which would be of great interest in its own right.
Even if it is established that ChatGPT was trained using pirated books, the court would still need to determine whether such usage constitutes copyright infringement. Some experts argue that this type of AI training could be considered fair use.
Fair use safeguards trans-formative applications of copyrighted works that do not compete directly with the original content. According to several experts, this defense could potentially apply to AI training scenarios.
Overall, the outcome of this lawsuit will shape the future landscape of AI and copyright law, with significant implications for both technology developers and content creators.
With a keen interest in tech, I make it a point to keep myself updated on the latest developments in technology and gadgets. That includes smartphones or tablet devices but stretches to even AI and self-driven automobiles, the latter being my latest fad.