4 Comments

Clear and insightful, thanks a lot.

Expand full comment

If synthetic data was the key to AI then the synthesizer would already be a sufficiently advanced AI. There is no such thing as a free lunch and conservation principles rule out any advances from use of synthetic data that would go above and beyond information already available in the initial data set used to train the synthesizer. This follows from basic computational principles. Any irreversible transformation applied to the source data set can only reduce its information content.

Expand full comment

This all makes sense if you assume that we stay in the current LLM paradigm. However, there's no fundamental reason that building a better AI should continue to require scaling data at that rate. We have the existence proof of humans learning with many fewer orders of magnitude of data.

We also, to my knowledge at least, have a limited fundamental understanding of what higher quality means for training data. I'm very curious to see if we can train an AI to achieve superhuman intelligence without access to any superhuman data, by definition.

Expand full comment

Maybe LLM training is really what they mean by "This conversation may be recorded for quality or training purposes"? Someday soon: OpenAI makes a free video conferencing solution available, so long as they're allowed to train off your conference....

Expand full comment