Not a rebuttal, but some thoughts for consideration
* Calling out that they all use the same underlying 'asset' (i.e. transformer architecture) as a point of consideration is kind of like calling out "they all speak the same language - English".
This point is a bit weaker than you may think; *nearly* any idea that can be represented in the spoken language can *potentially* be discovered by transformers. So, on a certain level, it's a weakness only in the absolute sense, not a practical sense.
* Your point about coming from the same training corpus is likewise weaker than you may think, since the training corpus includes *radically different points of view* that bring totally different methods of evaluation and vectors of thought to the table. These different methods of thought, can be applied and alchemized together in different ways, in different orders, with different levels of emphasis, to generate wildly different final analysis, and it's the domain / model specific 'secret sauce' that determines how good those alchemizations actually are. In this way, I feel your point is perhaps being overstated.
* However, I think if we combine those two points of concern, they do hint (indirectly) something real that supports your theorem!
Late 19th and early 20th century Economic theory, in the English speaking world, was a vast and robust field of research many different schools that had a wealth of nuances and substantive differences, sort of like how I described the modern LLM space above, yes?
And yet, there were powerful and subsequently transformative schools of economic thought that were *only being expressed in German*; the Austrian school stuff of Von Mises and Hayek. They *could* have been expressed in English, but no one was doing so. They had to translate and carry their ideas to the English speaking world for them to resonate.
(And for any ML folks reading this: yes, I’m deliberately oversimplifying.
I’m glossing over major architectural differences between models, overstating the theoretical discoverability of ideas in transformer space, and ignoring the very non-uniform representation of viewpoints in the training distribution. All true.
But none of those caveats really change the overall point. =p)
This is a brilliant nuance, especially the Austrian School analogy.
I concede that the capability space of Transformers (like the English language) is theoretically infinite. My concern is the incentive structure.
To use your metaphor: The library contains the radical German texts (the training data). But if every model is alchemized (fine-tuned) to prioritize the safety and consensus of the current establishment, the Austrian insight never surfaces.
We have a consensus-seeking mechanism that actively suppresses heterodox thought. In markets, that is the definition of a crowded trade.
Cool; glad my reasoning resonated. I think you're calling a real -thing-, but the underlying mechanics have more to do with stuff around exactly what you said: 'actively suppress heterodox thought' based on a mixture of heavy handed training biases, and the radical *non* uniform distribution of competing ideas in the training corpus.
Pulling back on the Austrian Economics stuff, a *perfect* LLM, with zero training biases, but educated only on the work of the English Language economics world, would never come up with an equivalent of the Austrian school stuff because it was all *orthogonal* to the training corpus.
Not a rebuttal, but some thoughts for consideration
* Calling out that they all use the same underlying 'asset' (i.e. transformer architecture) as a point of consideration is kind of like calling out "they all speak the same language - English".
This point is a bit weaker than you may think; *nearly* any idea that can be represented in the spoken language can *potentially* be discovered by transformers. So, on a certain level, it's a weakness only in the absolute sense, not a practical sense.
* Your point about coming from the same training corpus is likewise weaker than you may think, since the training corpus includes *radically different points of view* that bring totally different methods of evaluation and vectors of thought to the table. These different methods of thought, can be applied and alchemized together in different ways, in different orders, with different levels of emphasis, to generate wildly different final analysis, and it's the domain / model specific 'secret sauce' that determines how good those alchemizations actually are. In this way, I feel your point is perhaps being overstated.
* However, I think if we combine those two points of concern, they do hint (indirectly) something real that supports your theorem!
Late 19th and early 20th century Economic theory, in the English speaking world, was a vast and robust field of research many different schools that had a wealth of nuances and substantive differences, sort of like how I described the modern LLM space above, yes?
And yet, there were powerful and subsequently transformative schools of economic thought that were *only being expressed in German*; the Austrian school stuff of Von Mises and Hayek. They *could* have been expressed in English, but no one was doing so. They had to translate and carry their ideas to the English speaking world for them to resonate.
(And for any ML folks reading this: yes, I’m deliberately oversimplifying.
I’m glossing over major architectural differences between models, overstating the theoretical discoverability of ideas in transformer space, and ignoring the very non-uniform representation of viewpoints in the training distribution. All true.
But none of those caveats really change the overall point. =p)
This is a brilliant nuance, especially the Austrian School analogy.
I concede that the capability space of Transformers (like the English language) is theoretically infinite. My concern is the incentive structure.
To use your metaphor: The library contains the radical German texts (the training data). But if every model is alchemized (fine-tuned) to prioritize the safety and consensus of the current establishment, the Austrian insight never surfaces.
We have a consensus-seeking mechanism that actively suppresses heterodox thought. In markets, that is the definition of a crowded trade.
Cool; glad my reasoning resonated. I think you're calling a real -thing-, but the underlying mechanics have more to do with stuff around exactly what you said: 'actively suppress heterodox thought' based on a mixture of heavy handed training biases, and the radical *non* uniform distribution of competing ideas in the training corpus.
Pulling back on the Austrian Economics stuff, a *perfect* LLM, with zero training biases, but educated only on the work of the English Language economics world, would never come up with an equivalent of the Austrian school stuff because it was all *orthogonal* to the training corpus.
This article is a fresh air in the crowded space.
Pure joy to read those how go against the flow.
Thanks a lot. re-read 3 times.
thanks!