Those Fancy Large Language Models Were Trained On eBooks From Smashwords, Without Permission

With the avalanche of interest in generative AI and the large language models they are trained on, stories have resurfaced showing the process’s dirty roots. As shown in a 2021 working paper authored by Jack Bandy and Nicholas Vincent, OpenAI, Google’s BERT and variants, and many other foundational LLMs all have “documentation debt” to BookCorpus, “a popular text dataset for training large language models.” Compiled in 2014 by researchers at the University of Toronto and MIT, BookCorpus should have been called Stolen from Smashwords. The researchers apparently scraped posted, self-published ebooks posted by Smashwords that were being offered to read […]