Is Synthetic Data the Key to AGI?

Nabeel S. Qureshi

Feb 29, 2024

The data wars are just beginning. Part 2 in a series of essays on AI.

Read →

4 Comments

Konstantin Samoilov

Mar 12, 2024

Clear and insightful, thanks a lot.

Expand full comment

Cyberneticist

Mar 5, 2024

If synthetic data was the key to AI then the synthesizer would already be a sufficiently advanced AI. There is no such thing as a free lunch and conservation principles rule out any advances from use of synthetic data that would go above and beyond information already available in the initial data set used to train the synthesizer. This follows from basic computational principles. Any irreversible transformation applied to the source data set can only reduce its information content.

Expand full comment

Greg G

Mar 1, 2024Edited

This all makes sense if you assume that we stay in the current LLM paradigm. However, there's no fundamental reason that building a better AI should continue to require scaling data at that rate. We have the existence proof of humans learning with many fewer orders of magnitude of data.

We also, to my knowledge at least, have a limited fundamental understanding of what higher quality means for training data. I'm very curious to see if we can train an AI to achieve superhuman intelligence without access to any superhuman data, by definition.

Expand full comment

Kevin Postlewaite

Feb 29, 2024

Maybe LLM training is really what they mean by "This conversation may be recorded for quality or training purposes"? Someday soon: OpenAI makes a free video conferencing solution available, so long as they're allowed to train off your conference....

Expand full comment

Digital Spirits

Is Synthetic Data the Key to AGI?