Fresh Hacker News | Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

▲Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training(github.com)

132 points by xlayn 13 hours ago | 28 comments

▲simgt 1 hour ago

> I replicated David Ng's RYS method [...] found something I didn't expect.

> Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer.

How did you not expect that if you read his post? That's literally what he discovered, two years ago.

For anyone interested, there's more meat in the post and comments from last week: https://news.ycombinator.com/item?id=47322887

▲regularfry 32 minutes ago

That's explicitly not the unexpected part. Read the rest of the post.

▲Lerc 16 minutes ago

That weird part is kind of what I was expecting.

This goes to the thing that I posted on the thread a couple of days ago. https://news.ycombinator.com/item?id=47327132

What you need is a mechanism to pick the right looping pattern, Then it really does seem to be Mixture of experts on a different level.

Break the model into input path, thinking, output path. and make the thinking phase a single looping layer of many experts. Then the router gets to decide 13,13,14,14,15,15,16.

Training the router left as an exercise to the reader.

▲4bpp 8 hours ago

Assuming the benchmarks are sound (rather than capturing a fluke), the provided explanation still does not pass the smell test. As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n, unless perhaps these layers were initialised as identity and the training process did not get to change them much. (Plausible for middle layers?)

Considering this, I think (again, assuming the benchmarks themselves are sound) the most plausible explanation for the observations is (1) the layers being duplicated are close to the identity function on most inputs; (2) something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance; (3) the mechanism causing the degradation involves the duplicated layers, so their duplication has the effect of breaking the reasoning-degrading mechanism (e.g. by clobbering a "refusal" "circuit" that emerged in post-training).

More concisely, I'm positing that this is an approach that can only ever break things, and rather than boosting reasoning, it is selectively breaking things deleterious to reasoning.

▲kirill5pol 7 hours ago

Basically all of them are using residual connections so it’s not that surprising honestly

▲Karuma 9 hours ago

Wow, every single word in the original post and on that README.md is pure LLM. How sad.

In any case, this has been done at least since the very first public releases of Llama by Meta... It also works for image models. There are even a few ComfyUI nodes that let you pick layers to duplicate on the fly, so you can test as many as you want really quickly.

▲xlayn 8 hours ago

Fair point on the writing style, I used Claude extensively on this project, including drafting. The experiments and ideas are mine though.

On the prior art: you're right that layer duplication has been explored before. What I think is new here is the systematic sweep toolkit + validation on standard benchmarks (lm-eval BBH, GSM8K, MBPP) showing exactly which 3 layers matter for which model. The Devstral logical deduction result (0.22→0.76) was a surprise to me.

If there are ComfyUI nodes that do this for image models, I'd love links, the "cognitive modes" finding (different duplication patterns that leads to different capability profiles from the same weights) might be even more interesting for diffusion models.

▲abhikul0 4 hours ago

I only know of this one: https://github.com/shootthesound/comfyUI-Realtime-Lora. Haven't played with any layer manipulation though.

▲kgeist 7 hours ago

Heh, for a couple last days, I've been doing this exact kind of "neuroanatomy" on Qwen2.5/Qwen3 too. Fascinating stuff. To make it easier to fiddle with the network, I created a small inference engine that is stripped of all the framework magic, just raw matmuls and all (main inference loop is just 50 lines of code!). For example, it's trivial to remove a layer: i just skip it in code with a simple "if". I've found that removing some layers doesn't appear to change anything (based on the vibes at least). If you remove some later layers, the model forgets how to insert the EOS token and keeps chatting ad finitum (still coherently). Removing earliest layers makes the model generate random garbage. Turns out abliteration is not hard to do, 10 examples was enough to find the refusal vector and cancel most refusals. Interestingly, I've found that refusal happens in the middle layers too (I think, layer 12 out of 26)

From what I understand, transformers are resistant to network corruption (without complete collapse) thanks to residual connections.

I tried to repeat some layers too but got garbage results. I guess I need to automate finding the reasoning layers too, instead of just guessing.

▲gmerc 2 hours ago

Hook it up in autoresearch?

▲taliesinb 9 hours ago

There is an obvious implication: since the initial models were trained without loops, it is exceedingly unlikely that a single stack of consecutive N layers represents only a single, repeatable circuit that can be safely looped. It is much more likely that the loopable circuits are superposed across multiple layers and have different effective depths.

That you can profitably loop some say 3-layer stack is likely a happy accident, where the performance loss from looping 3/4 of mystery circuit X that partially overlaps that stack is more than outweighed by the performance gain from looping 3/3 of mystery circuit Y that exactly aligns with that stack.

So, if you are willing to train from scratch, just build the looping in during training and let each circuit find its place, in disentangled stacks of various depths. Middle of transformer is:

(X₁)ᴹ ⊕ (Y₁∘Y₂)ᴺ ⊕ (Z₁∘Z₂∘Z₃)ᴾ ⊕ …

Notation: Xᵢ is a layer (of very small width) in a circuit of depth 1..i..D, ⊕ is parallel composition (which sums the width up to rest of transformer), ∘ is serial composition (stacking), and ᴹ is looping. The values of ᴹ shouldnt matter as long as they are > 1, the point is to crank them up after training.

Ablating these individual circuits will tell you whether you needed them at all, but also roughly what they were for in the first place, which would be very interesting.

▲taliesinb 9 hours ago

And i bet these would be useful in initial and final parts of transformer too. Because syntactic parsing and unparsing of brackets, programming language ASTs, etc is highly recursive; no doubt current models are painfully learning "unrolled" versions of the relevant recursive circuits, unrolled to some fixed depth that must compete for layers with other circuits, since your total budget is 60 or whatever. Incredibly duplicative and by definition unable to generalize to arbitrary depth!

▲awwaiid 8 hours ago

Maybe another idea, no idea if this is a thing, you could pick your block-of-layers size (say... 6) and then during training swap those around every now and then at random. Maybe that would force the common api between blocks, specializaton of the blocks, and then post training analyze what each block is doing (maybe by deleting it while running benchmarks).

▲taliesinb 9 hours ago

Amusingly, you need only have circuits of prime depth, though you should probably adjust their widths using something principled, perhaps Euler's totient function.

▲Imanari 1 hour ago

Fascinating! I wonder if new training techniques could emerge from this. If we say layer-1=translater, layer2-5=reasoner, layer6 retranslater, could we train small 6 layer models but evaluate their performance in a 1>n*(2-5)>6 setup to directly train towards optimal middle-layers that can be looped? You'd only have to train 6 layers but get the duplication-benefit of the middle layers for free.

▲SyzygyRhythm 10 hours ago

If running twice is good, then is running N times even better? I wonder if you could even loop until some kind of convergence, say hitting a fixed point (input equals output). I wonder if there's even a sort of bifurcation property where it sometimes loops A->A->A, but other times A->B->A, or more, rather like the logistic map fractal.

▲xlayn 9 hours ago

I explored that, again with Devstral, but the execution with 4 times the same circuit lead to less score on the tests.

I chat with the model to see if the thing was still working and seemed coherent to me, I didn't notice anything off.

I need to automate testing like that, where you pick the local maxima and then iterate over that picking layers to see if it's actually better, and then leave the thing running overnight

▲smusamashah 2 hours ago

Can Karpathy's autoresearch be used on this to explore what works and what does not? That is supposed to automate research like this from what I understand.

▲imtringued 2 hours ago

That's how deep equilibrium models were discovered.

Whats's more. It was found out that you only need a single looped layer to be equivalent to a multi layer network.

▲christianqchung 8 hours ago

Why test on Qwen 2.5 when Qwen 3 has been out for about a year, and Qwen 3.5 for a month? My problem with this is ironically entirely vibes based: that for some reason, LLMs love to talk about Qwen 2.5 instead of anything newer.

▲dhsorens30 2 hours ago

the token costs are real. we switched to smaller models for 80% of tasks and barely noticed

▲snats 8 hours ago

you can also have removed layers of models and keep the same score in benchmarks [1].

i feel that sometimes a lot of the layers might just be redundant and are not fully needed once a model is trained.

[1] https://snats.xyz/pages/articles/pruningg.html

▲kristianp 7 hours ago

The method used here by David Ng, was discussed a few days ago at https://news.ycombinator.com/item?id=47322887

▲nowittyusername 9 hours ago

There's still a lot of low hanging fruit left IMO. Good find and rather funny to think about as you can have someone simply clone the various layers multiple times and instead of spending millions of dollars retraining the model increase performance significantly with "this one trick".

▲skerit 36 minutes ago

I've been running my own (admittedly naïve) experiments of new, wacky ideas for both LLMs (well, SLMs) and for Image-Super-Resolution models.

I'm just trying different kinds of attention mechanisms, different configurations of the network, adding loops, ... All kind of wacky ideas. And the real weird thing is that 99% of the ideas I try work at all.

▲xlayn 9 hours ago

The other interesting point is that right now I'm copy pasting the layers, but a patch in llama.cpp can make the same model now behave better by a fact of simply following a different "flow" without needing more vram...

if this is validated enough it can eventually lead to ship some kind of "mix" architecture with layers executed to fit some "vibe?"

Devstral was the first one I tried and optimize for math/eq, but that din't result in any better model, then I added the reason part, and that resulted in "better" model

I used the devstral with the vibe.cli and it look sharp to me, thing didn't fail, I also used the chat to "vibe" check it and look ok to me.

The other thing is that I pick a particular circuit and that was "good" but I don't know if it was a local maxima, I think I ran just like 10 sets of the "fast test harness" and pick the config that gave the most score... once I have that I use that model and run it against the llm_eval limited to only 50 tests... again for sake of speed, I didn't want to wait a week to discover the config was bad

▲woadwarrior01 10 hours ago

Reminds me of Solar 10.7B, which was a very good model for its size ~2 year ago and the "Depth Up-Scaling" technique behind it. Although, that involved continued training after repeating the layers.

https://arxiv.org/abs/2312.15166

▲rao-v 10 hours ago

I’d love to believe this is real, but I’m pretty sure you will lose performance on a “fair” mix of tasks, even after fine tuning. I know multiple teams have explored recurrent layers (great for limited VRAM) but I don’t think it’s ever been found to be optimal.

▲zhangchen 9 hours ago

this lines up with what pruning papers have been finding, the middle layers carry most of the reasoning weight and you can often drop the outer ones without much loss. cool to see the inverse also works, just stacking them for extra passes.

▲getnormality 7 hours ago

Didn't we recently see another hack, where you could get better performance by repeating the prompt?

I wonder if they work for similar reasons.

▲colejhudson 10 hours ago

Would you be able to publish the individual benchmarks for Qwen2.5-Coder-32B? GSM8K specifically would be useful to look at.

▲xlayn 9 hours ago

I published the results for devstral... results folder of the github https://github.com/alainnothere/llm-circuit-finder/tree/main...

I'm using the following configuration --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp I did also try humaneval but something in the harness is missing and failed...

notice that I'm running 50 tests for each task, mostly because of time limitation as it takes like two hours to validate the run for the base model and the modified one.

I'll also try to publish the results of the small tests harness when I'm testing the multiple layers configurations, for reference this is phi-4-Q6_K.gguf, still running, I'm now giving more importance to the Reason factor, the reason factor comes from running a small subset of all the problems in the task config above

Initially I tried the approach of the highest math/eq but in resulted in models that were less capable overall with the exception of math, and math like in the original research is basically how good was the model at giving you the answer of a really though question, say the cubic root of some really large number... but that didn't translate to the model being better at other tasks...

  Config  | Lyr | Math   | EQ    | Reas   | Math Δ  | EQ Δ  | Reas Δ  | Comb Δ
  --------|-----|--------|-------|--------|---------|-------|---------|-------
  BASE    |   0 | 0.7405 | 94.49 | 94.12% |     --- |   --- |     --- |    ---
  (6,9)   |   3 | 0.7806 | 95.70 | 94.12% | +0.0401 | +1.21 |  +0.00% |  +1.21
  (9,12)  |   3 | 0.7247 | 95.04 | 94.12% | -0.0158 | +0.55 |  +0.00% |  +0.55
  (12,15) |   3 | 0.7258 | 94.14 | 88.24% | -0.0147 | -0.35 |  -5.88% |  -6.23
  (15,18) |   3 | 0.7493 | 95.74 | 88.24% | +0.0088 | +1.25 |  -5.88% |  -4.63
  (18,21) |   3 | 0.7204 | 93.40 | 94.12% | -0.0201 | -1.09 |  +0.00% |  -1.09
  (21,24) |   3 | 0.7107 | 92.97 | 88.24% | -0.0298 | -1.52 |  -5.88% |  -7.41
  (24,27) |   3 | 0.6487 | 95.27 | 88.24% | -0.0918 | +0.78 |  -5.88% |  -5.10
  (27,30) |   3 | 0.7180 | 94.65 | 88.24% | -0.0225 | +0.16 |  -5.88% |  -5.73
  (30,33) |   3 | 0.7139 | 94.02 | 94.12% | -0.0266 | -0.47 |  +0.00% |  -0.47
  (33,36) |   3 | 0.7104 | 94.53 | 94.12% | -0.0301 | +0.04 |  +0.00% |  +0.04
  (36,39) |   3 | 0.7017 | 94.69 | 94.12% | -0.0388 | +0.20 |  +0.00% |  +0.20
  (6,10)  |   4 | 0.8125 | 96.37 | 88.24% | +0.0720 | +1.88 |  -5.88% |  -4.01
  (9,13)  |   4 | 0.7598 | 95.08 | 94.12% | +0.0193 | +0.59 |  +0.00% |  +0.59
  (12,16) |   4 | 0.7482 | 93.71 | 88.24% | +0.0076 | -0.78 |  -5.88% |  -6.66
  (15,19) |   4 | 0.7617 | 95.16 | 82.35% | +0.0212 | +0.66 | -11.76% | -11.10
  (18,22) |   4 | 0.6902 | 92.27 | 88.24% | -0.0504 | -2.23 |  -5.88% |  -8.11
  (21,25) |   4 | 0.7288 | 94.10 | 88.24% | -0.0117 | -0.39 |  -5.88% |  -6.27
  (24,28) |   4 | 0.6823 | 94.57 | 88.24% | -0.0583 | +0.08 |  -5.88% |  -5.80
  (27,31) |   4 | 0.7224 | 94.41 | 82.35% | -0.0181 | -0.08 | -11.76% | -11.84
  (30,34) |   4 | 0.7070 | 94.73 | 94.12% | -0.0335 | +0.23 |  +0.00% |  +0.23
  (33,37) |   4 | 0.7009 | 94.38 |100.00% | -0.0396 | -0.12 |  +5.88% |  +5.77
  (36,40) |   4 | 0.7057 | 94.84 | 88.24% | -0.0348 | +0.35 |  -5.88% |  -5.53
  (6,11)  |   5 | 0.8168 | 95.62 |100.00% | +0.0762 | +1.13 |  +5.88% |  +7.02
  (9,14)  |   5 | 0.7245 | 95.23 | 88.24% | -0.0160 | +0.74 |  -5.88% |  -5.14
  (12,17) |   5 | 0.7825 | 94.88 | 88.24% | +0.0420 | +0.39 |  -5.88% |  -5.49
  (15,20) |   5 | 0.7832 | 95.86 | 88.24% | +0.0427 | +1.37 |  -5.88% |  -4.52
  (18,23) |   5 | 0.7208 | 92.42 | 88.24% | -0.0197 | -2.07 |  -5.88% |  -7.95
  (21,26) |   5 | 0.7055 | 92.89 | 88.24% | -0.0350 | -1.60 |  -5.88% |  -7.48
  (24,29) |   5 | 0.5825 | 95.04 | 94.12% | -0.1580 | +0.55 |  +0.00% |  +0.55
  (27,32) |   5 | 0.7088 | 94.18 | 88.24% | -0.0317 | -0.31 |  -5.88% |  -6.19
  (30,35) |   5 | 0.6787 | 94.69 | 88.24% | -0.0618 | +0.20 |  -5.88% |  -5.69
  (33,38) |   5 | 0.6650 | 94.96 | 88.24% | -0.0755 | +0.47 |  -5.88% |  -5.41
  (6,12)  |   6 | 0.7692 | 95.39 | 94.12% | +0.0287 | +0.90 |  +0.00% |  +0.90
  (9,15)  |   6 | 0.7405 | 94.65 | 94.12% | -0.0000 | +0.16 |  +0.00% |  +0.16
  (12,18) |   6 | 0.7582 | 94.57 | 88.24% | +0.0177 | +0.08 |  -5.88% |  -5.80
  (15,21) |   6 | 0.7828 | 93.52 | 88.24% | +0.0423 | -0.98 |  -5.88% |  -6.86
  (18,24) |   6 | 0.7308 | 92.93 | 94.12% | -0.0097 | -1.56 |  +0.00% |  -1.56
  (21,27) |   6 | 0.6791 | 92.54 | 82.35% | -0.0615 | -1.95 | -11.76% | -13.72

▲XCSme 9 hours ago

But if it got worse on other tests, it doesn't do much good, right?

▲ekianjo 9 hours ago

Which tests are worse?

▲XCSme 9 hours ago

Hard to tell, they only mention a few ones that got better, not clear results on others

▲xlayn 9 hours ago

You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command

  # Run lm-evaluation-harness
  lm_eval --model local-chat-completions \
      --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \
      --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \
      --apply_chat_template --limit 50 \
      --output_path ./eval_results

▲m3kw9 8 hours ago

What, just randomly choose some "layer" and duplicate it and give some arbitrary reasoning went from 0.2 -> 0.7, i don't know man. You need to use real benchmarks.

▲3eb7988a1663 7 hours ago

Someone recently posted the exact same idea to much acclaim: https://news.ycombinator.com/item?id=47322887

▲ 9 hours ago

▲Singlaw 8 hours ago

What does this do?

▲elonisaass 1 hour ago

[dead]

▲rafaamaral 8 hours ago

[flagged]

▲builderhq_io 2 hours ago

[dead]

▲Iamkkdasari74 2 hours ago

[dead]

▲realaliarain74 7 hours ago

[dead]

▲accesspatchh 10 hours ago

[flagged]

▲rogerrogerr 10 hours ago

> Don't post generated comments or AI-edited comments. HN is for conversation between humans.

▲dafelst 10 hours ago

Why are people running AI bots to make comments on HN?