Look, I get your perspective, but zooming out there is a context that nobody’s mentioning, and the thread deteriorated into name-calling instead of looking for insight.
In theory, a training pass needs one readthrough of the input data, and we know of existing systems that achieve that, from well-trodden n-gram models to the wholly-hypothetical large Lempel-Ziv models. Viewed that way, most modern training methods are extremely wasteful: Transformers, Mamba, RWKV, etc. are trading time for space to try to make relatively small models, and it’s an expensive tradeoff.
From that perspective, we should expect somebody to eventually demonstrate that the Transformers paradigm sucks. Mamba and RWKV are good examples of modifying old ideas about RNNs to take advantage of GPUs, but are still stuck in the idea that having a GPU perform lots of gradient descent is good. If you want to critique something, critique the gradient worship!
I swear, it’s like whenever Chinese folks do anything the rest of the blogosphere goes into panic. I’m not going to insult anybody directly but I’m so fucking tired of mathlessness.
Also, point of order: Meta open-sourced Llama so that their employees would stop using Bittorrent to leak it! Not to “keep the rabble quiet” but to appease their own developers.
Look, I get your perspective, but zooming out there is a context that nobody’s mentioning, and the thread deteriorated into name-calling instead of looking for insight.
In theory, a training pass needs one readthrough of the input data, and we know of existing systems that achieve that, from well-trodden n-gram models to the wholly-hypothetical large Lempel-Ziv models. Viewed that way, most modern training methods are extremely wasteful: Transformers, Mamba, RWKV, etc. are trading time for space to try to make relatively small models, and it’s an expensive tradeoff.
From that perspective, we should expect somebody to eventually demonstrate that the Transformers paradigm sucks. Mamba and RWKV are good examples of modifying old ideas about RNNs to take advantage of GPUs, but are still stuck in the idea that having a GPU perform lots of gradient descent is good. If you want to critique something, critique the gradient worship!
I swear, it’s like whenever Chinese folks do anything the rest of the blogosphere goes into panic. I’m not going to insult anybody directly but I’m so fucking tired of mathlessness.
Also, point of order: Meta open-sourced Llama so that their employees would stop using Bittorrent to leak it! Not to “keep the rabble quiet” but to appease their own developers.