With the avalanche of interest in generative AI and the large language models they are trained on, stories have resurfaced showing the process’s dirty roots. As shown in a 2021 working paper authored by Jack Bandy and Nicholas Vincent, OpenAI, Google’s BERT and variants, and many other foundational LLMs all have “documentation debt” to BookCorpus, […]