I extended this to a 2×2 design (two languages × two content types) and the result is even starker: by layer 10, cross-language same-content pairs are more similar than same-language different-content pairs. The model cares about what you're saying, not what language you're saying it in.
This is also what makes layer duplication work — those mid-stack layers operate in a space where input and output distributions match, so you can loop through them without breaking anything. The encoding and decoding boundaries are where the blue walls show up in the heatmaps.
> The model cares about what you're saying, not what language you're saying it in.
What is the number of languages model is trained upon? And what is the number of training set sentences? I believe that these numbers are vastly different and cosine similarity is overwhelmingly biased by number of sentences.What if we equalize number of languages and number of sentences in the training set? A galaxy-wise LLM, so to say.
Also, model can't help but care about language because your work shows divergence of cosine similarity at the decoding (output) stage(s).
Would this be sort of like saying the way embeddings of different primitives across languages end up distributed in a vector space all follow the same principles and "laws"?
For example, if I train a large corpus of english and, separately, a large corpus of spanish, in both cases the way language constructs that are equivalent across both will end up represented using the same vector space patterns?
I think this says something interesting about how transformers organise computation internally. The mid-stack reasoning circuits are coherent enough that you can loop through them twice without distribution mismatch. The encoding/decoding boundaries are not.
This work here is obviously more complex than that, but suggests something similar is going on with early layers transforming to some sort of generalized basis functions defining a universal language representation.
The RYS (repeat yourself) hypothesis that duplicating (the right) layers is enough to improve performance (sorry for not reading closely enough, it's really just stacking the relevant layers?).
The ERD (encoding, reasoning, decoding) layer structure is a relatively robust observation? That the middle layers of the NN will reason in universal space, and this is kinda evidenced by cosine similarities of the hidden states at each layer given similar or dissimilar inputs. And that similar inputs converges by layer 5 and you can kinda watch that happen in the cosine similarities?
This post is incredible and I'm afraid it'll drop off the front page before people engage deeply with it. (The methodology was interesting, maybe there's other big ideas I'm missing.)
But the methodology to measure it and put numbers on which layers are most involved in encoding/decoding and where the reasoning takes place is very valuable.
The finding that the phases are more cleanly separated in large-ish models is interesting. I wonder what this could mean for embedding models? Usually we take small LLMs and chop off the last couple layers to get an embedding model. But I wonder if you could get better embedding models using something like the first five layers of Qwen3.5-27B, or the first X layers of Kimi K2.5? The methodology in the article seems to give a straight forward way to find the optimal cutting point
Though beware that the increased score on math and EQ could lead to other areas scoring less well; would love to see how these models score on all open benchmarks.
Do you remember the names of the previous experiments done on this? Would love to take a look.
Has some interesting github links.
Thanks for this research. I remember being stunned when Goliath showed up and .. worked; this feels like under explored research right now.
I've been thinking about implications of this for local generation -- what's really nice about a repeated layer is it takes up no extra memory -- and therefore works well on the edge.
Can you suggest some exploration angles on the edge side? I've recently started looking at fixing expert layers for an entire generation run as interesting - basically you pay the memory cost once for loading in selected experts - and I think RYS type thinking is a natural extension of this. If you've got some ideas, I'm all ears.
It has some similarities of a MoE architecture, but instead of choosing experts, it chooses layer routes. Training this NN classifier together with the LLM could condense the required amount of layers for a given intelligence down drastically if it works. If anyone wants to work on this, feel free to send me a message.
I have pushed basic code to GitHub (https://github.com/dnhkng/RYS)
Some interesting areas to explore might be a combination of deleting some layers and duplicating others. i.e. reduce VRAM by dropping some layer (this works, well documented), and recovering performance by duplicating others (saves VRAM). I am not pursuing this, but it seems interesting!
he learnt icelandic in week and had a fluent conversation on their national TV to prove it. (this is nuts, that language is extremely difficult to pickup with nasal sounds etc.)
ofcourse i guess its not even close to average to have such a abilities as a human, but i wonder if at some point LLMs and AI algorithms and models might shed light on such kind of abstractions (like some mentioned in comments also about image recognition algos) that might help humans actually learn these things themselves, train on them and perhaps even get taught such a thing as a skill.
It turns out this does not help (somewhat surprisingly).
The residual connections resemble the Euler method (this observation led to Neural ODE's IIRC) which isn't known to be exactly clean. If the model has been trained to be a particular number of layers, adding more layers will also add a lot of noise.
Ultimately, the LLM will need to be fine tuned with the loops or a looped architecture trained from scratch, such as: <https://ouro-llm.github.io> unfortunately they made the mistake of looping the entire LLM rather than just the center portion.
I am working with TurboDerp to integrate this into the Exllama v3 format.
The probes I used seem to help identify good configurations, but are quite noisey. A small probe set was initially used to make the scan tractable, and then the higher ranked models were retested on a set ~10x larger.
As in, this entire cloud buildout is unnecessary because it becomes like using a calculator.
Reach out to chat.