{1578} revision 1 modified: 11-29-2023 23:58 gmt

God Help us, let's try to understand AI monosemanticity

Commentary: To some degree, superposition seems like a geometric "hack" invented in the process of optimization to squeeze a great many (largely mutually-exclusive) sparse features into a limited number of neurons. GPT3 has a latent dimension of only 96 * 128 = 12288, and with 96 layers this is only 1.17 M neurons (*). A fruit fly has 100k neurons (and can't speak). All communication must be through that 12288 dimensional vector, which is passed through LayerNorm many times (**), so naturally the network learns to take advantage of locally linear subspaces.

That said, the primate visual system does seem to use superposition, though not via local subspaces; instead, neurons seem to encode multiple axes somewhat linearly (e.g. global spaces: linearly combined position and class) That was a few years ago, and I suspect that new results may contest this. The face area seems to do a good job of disentanglement, for example.

Treating everything as high-dimensional vectors is great for analogy making, like the wife - husband + king = queen example. But having fixed-size vectors for representing arbitrary-dimensioned relationships inevitably leads to compression ~= superposition. Provided those subspaces are semantically meaningful, it all works out from a generalization standpoint -- but this is then equivalent to allocating an additional axis for said relationship or attribute. Additional axes would also put less decoding burden on the downstream layers, and make optimization easier.

Google has demonstrated allocation in transformers. It's also prevalent in the cortex. Trick is getting it to work!

(*) GPT4 is unlikely to have more than an order of magnitude more 'neurons'; PaLM-540B has only 2.17 M. Given that GPT-4 is something like 3-4x larger, it should have 6-8 M neurons, which is still 3 orders of magnitude fewer than the human neocortex (nevermind the cerebellum ;-)

(**) I'm of two minds on LayerNorm. PV interneurons might be seen to do something like this, but it's all local -- you don't need everything to be vector rotations. (LayerNorm effectively removes one degree of freedom, so really it's a 12287 dimensional vector)

Update: After reading https://transformer-circuits.pub/2023/monosemantic-features/index.html, I find the idea of local manifolds / local codes to be quite appealing: why not represent sparse yet conditional features using superposition?  This also expands the possibility of pseudo-hierarchical representation, which is great.