Greedy AI models

Meta said the 15 trillion tokens on which its (sic) trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Comments

Popular posts from this blog

Perplexity

Aphorisms: AI

DeepAI's Austen on China