Your Books Trained Those Large Language Models

As is already being contested in a number of lawsuits seeking class action status, the core datasets on which all of the major large language models have been trained rely on stolen, copyrighted books. As a new Atlantic magazine article by Alex Reisner puts it, “Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.” BookCorpus was stolen from Smashwords authors. Books3 is a body of between 150,000 and 190,000 books from established publishers and authors. The Atlantic piece extracts the […]