Fresh Hacker News | Universal Claude.md – cut Claude output tokens

▲Universal Claude.md – cut Claude output tokens(github.com)

441 points by killme2008 19 hours ago | 67 comments

▲btown 18 hours ago

It seems the benchmarks here are heavily biased towards single-shot explanatory tasks, not agentic loops where code is generated: https://github.com/drona23/claude-token-efficient/blob/main/...

And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions?

The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce."

By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.

▲sillysaurusx 18 hours ago

I wrote a skill called /handoff. Whenever a session is nearing a compaction limit or has served its usefulness, it generates and commits a markdown file explaining everything it did or talked about. It’s called /handoff because you do it before a compaction. (“Isn’t that what compaction is for?” Yes, but those go away. This is like a permanent record of compacted sessions.)

I don’t know if it helps maintain long term coherency, but my sessions do occasionally reference those docs. More than that, it’s an excellent “daily report” type system where you can give visibility to your manager (and your future self) on what you did and why.

Point being, it might be better to distill that long term cohesion into a verbose markdown file, so that you and your future sessions can read it as needed. A lot of the context is trying stuff and figuring out the problem to solve, which can be documented much more concisely than wanting it to fill up your context window.

EDIT: Someone asked for installation steps, so I posted it here: https://news.ycombinator.com/item?id=47581936

▲dataviz1000 18 hours ago

Did you call it '/handoff' or did Claude name it that? The reason I'm asking is because I noticed a pattern with Claude subtly influencing me. For example, the first time I heard the the word 'gate' was from Claude and 1 week later I hear it everywhere including on Hacker News. I didn't use the word 'handoff' but Claude creates handoff files also [0]. I was thinking about this all day. Because Claude didn't just use the word 'gate' it created an entire system around it that includes handoffs that I'm starting to see everywhere. This might mean Claude is very quietly leading and influencing us in a direction.

[0] https://github.com/search?q=repo%3Aadam-s%2Fintercept%20hand...

▲sillysaurusx 18 hours ago

I was reading through the Claude docs and it was talking about common patterns to preserve context across sessions. One pattern was a "handoff file", which they explained like "have claude save a summary of the current session into a handoff file, start a new session, then tell it to read the file."

That sounded like a nice idea, so I made it effortless beyond typing /handoff.

The generated docs turned out to be really handy for me personally, so I kept using it, and committed them into my project as they're generated.

▲dataviz1000 18 hours ago

Oh, so the word 'gate' is probably in the documentation also!

I see. So this isn't as scary. Claude is helping me understand how to use it properly.

▲perching_aix 6 hours ago

If this was more than just a gut reaction [0], I have a tough time navigating what swings this topic between scary and not scary for you.

Unless you're a true and invested believer of souls, free will, and other spiritualistic nonsense (or have a vested political affiliation to pretend so), it should be tautological that everything you read and experience biases you. LLM output then is no different.

If you are a believer, then either nothing ever did, or LLMs are special in some way, or everything else is. Which just doesn't make sense to me.

[0] It's jarring to observe the boundaries of one's agency, sure, but LLMs are really nothing special in this way. For example, I somewhat frequently catch myself using words and phrases I saw earlier during the day elsewhere, even if I did not process them consciously.

▲nerdsniper 7 hours ago

I have noticed similar phenomena with Claude, where its vocabulary subtly shifts how I think/frame/write about things or points me to subtle gaps in my own understanding. And I also usually come around to understand that it's often not arbitrary. But I do think some confirmation bias is at play: when it tries to shift me into the wrong directions repeatedly, I learn how to make it stop doing that.

It definitely adds a layer of cognitive load, in wrangling/shepherding/accomodating/accepting the unpredictable personalities and stochastic behaviors of the agents. It has strong default behaviors for certain small tasks, and where humans would eventually habituate prescribed procedures/requirements, the LLM's never really internalize my preferences. In that way, they are more like contractors than employees.

▲airstrike 17 hours ago

Why would it be scary? Claude is just parroting other human knowledge. It has no goal or agency.

▲adrianN 15 hours ago

You can’t verify that there is no influence by the makers of Claude.

▲airstrike 6 hours ago

I would certainly expect everyone to assume that influence rather than not.

▲fwipsy 16 hours ago

By that logic, nothing computers do is scary.

▲OJFord 13 hours ago

Yes I think that is their argument.

▲ 6 hours ago

▲rendx 12 hours ago

Computer don't do anything.

▲perching_aix 6 hours ago

What's their value then?

▲filoleg 5 hours ago

Just like with absolutely any other tool, their value is in what it enables humans using them to accomplish.

E.g., a hammer doesn't do anything, and neither does a lawnmower. It would be silly to argue (just because these tools are static objects doing nothing in the absence of direct human involvement) that those tools don't have a very clear value.

▲perching_aix 4 hours ago

Seems equally silly to me to suggest that hammers and lawnmowers don't do anything, but I mean here we are.

When people use other people like tools, i.e. use them to enable themselves to accomplish something, do those people cease to do things as well? Or is that not a terminology you recognize as sensible maybe?

I appreciate that for some people the verb "do" is evidently human(?) exclusive, I just struggle to wrap my head around why. Or is this an animate vs. inanimate thing, so animals operating tools also do things in your view?

How do you phrase things like "this API consumes that kind of data" in your day to day?

▲filoleg 3 hours ago

> Seems equally silly to me to suggest that hammers and lawnmowers don't do anything, but I mean here we are.

To be clear, I am not the person you were originally replying to. I personally don't care much for the terminology semantics of whether we should say "hammers do things" (with the opponents claiming it to be incorrect, since hammers cannot do anything on their own). I am more than happy to use whichever of the two terms the majority agrees upon to be the most sensible, as long as everyone agrees on the actual meaning of it.

> I appreciate that for some people the verb "do" is evidently human(?) exclusive, I just struggle to wrap my head around why. Or is this an animate vs. inanimate thing, so animals operating tools also do things in your view?

To me, it isn't human-exclusive. I just thought that in the context of this specific comment thread, the user you originally replied to used it as a human-exclusive term, so I tried explaining in my reply how they (most likely) used it. For me, I just use whichever term that I feel makes the most sense to use in the context, and then clarify the exact details (in case I suspect the audience to have a number of people who might use the term differently).

> How do you phrase things like "this API consumes that kind of data" in your day to day?

I would use it the exact way you phrased it, "this API consumes that kind of data", because I don't think anyone in the audience would be confused or unclear about what that actually means (depends on the context ofc). Imo it wouldn't be wrong to say "this API receives that kind of data as input" either, but it feels too verbose and awkward to actually use.

▲perching_aix 3 hours ago

I'm not sure how to respond then, because having a preferred position on this is kind of essential to continue. It's the contended point. Can an LLM do things? I think they can, they think they cannot. They think computers cannot do anything in general outright.

To me, what's essential for any "doing" to happen is an entity, a causative relationship, and an occurrence. So a lawnmower can absolutely mow the lawn, but also the wind can shape a canyon.

In a reference frame where a lawnmower cannot mow independently because humans designed it or operate it, humans cannot do anything independently either. Which is something I absolutely do agree with by the way, but then either everything is one big entity, or this is not a salient approach to segmenting entities. Which is then something I also agree with.

And so I consider the lawnmower its own entity, the person operating or designing it their own entity, and just evaluate the process accordingly. The person operating the lawnmower has a lot of control on where the lawnmower goes and whether it is on, the lawnmower has a lot of control over the shape of the grass, and the designer of the lawnmower has a lot of control over what shapes can the lawnmower hope to create.

Clearly they then have more logic applied, where they segment humans (or tools) in this a more special way. I wanted to probe into that further, because the only such labeling I can think of is spiritualistic and anthropocentric. I don't find such a model reasonable or interesting, but maybe they have some other rationale that I might. Especially so, because to me claiming that a given entity "does things" is not assigning it a soul, a free will, or some other spiritualistic quality, since I don't even recognize those as existing (and thus take great issue with the unspoken assumption that I do, or that people like me do).

The next best thing I can maybe think of is to consider the size of the given entity's internal state, and its entropy with relation to the occurred causative action and its environment. This is because that's quite literally how one entity would be independent of another, while being very selective about a given action. But then LLMs, just like humans, got plenty of this, much unlike a hammer or a lawnmower. So that doesn't really fit their segmentation either. LLMs have a lot less of it, but still hopelessly more than any virtual or physical tool ever conceived prior. The closest anything comes (very non-coincidentally) are vector and graph databases, but then those only respond to very specific, grammar-abiding queries, not arbitrary series of symbols.

▲hrimfaxi 3 hours ago

Computers perform computations. They do what programmers instruct them to do by their nature.

▲filoleg 3 hours ago

Agreed, just like hammers get the nails hammered into a woodboard. They do what the human operator manually guides them to do by their nature.

I am not disagreeing with you in the slightest, I feel like this is just a linguistic semantics thing. And I, personally, don't care how people use those words, as long as we are on the same page about the actual meaning of what was said. And, in this case, I feel like we are fully on the same page.

▲jstanley 12 hours ago

FWIW I have worked with people using the word "gate" for years.

For example, "let's gate the new logic behind a feature flag".

▲reedlaw 6 hours ago

Claude has trained me on the use of the word 'invariant'. I never used it before, but it makes sense as a term for a rule the system guarantees. I would have used 'validation' for application-side rules or 'constraint' for db rules, but 'invariant' is a nice generic substitute.

▲ProofHouse 14 hours ago

They all are. This is proven in research. https://medium.com/data-science-collective/the-ai-hivemind-p...

▲creamyhorror 12 hours ago

I've started saying "gate" and "bound(ed)" and "handoff" a lot (and even "seam" and "key off" sometimes) since Codex keeps using the terms. They're useful, no doubt, but AI definitely seems to prefer using them.

▲flashgordon 16 hours ago

I've actually been doing this for a year. I call it /checkpoint instead and it does some thing like:

* update our architecture.md and other key md files in folders affected by updates and learnings in this session. * update claude.md with changes in workflows/tooling/conventions (not project summaries) * commit

It's been pretty good so far. Nothing fancy. Recently I also asked to keep memories within the repo itself instead of in ~/.claude.

Only downside is it is slow but keeps enough to pass the baton. May be "handoff" would have been a better name!

▲chermi 17 hours ago

Did the same. Although I'm considering a pipeline where sessions are periodically translated to .md with most tool outputs and other junk stripped and using that as source to query against for context. I am testing out a semi-continuous ingestion of it in to my rag/knowledge db.

▲david_allison 18 hours ago

Is this available online? I'd love documentation of my prompts.

▲sillysaurusx 18 hours ago

I’ll post it here, one minute.

Ok, here you go: https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf...

Installation steps:

- In your project, download https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf... into .claude/commands/handoff.md

- In your project's CLAUDE.md file, put "Read `docs/agents/handoff/*.md` for context."

Usage:

- Whenever you've finished a feature, done a coherent "thing", or otherwise want to document all the stuff that's in your current session, type /handoff. It'll generate a file named e.g. docs/agents/handoff/2026-03-30-001-whatever-you-did.md. It'll ask you if you like the name, and you can say "yes" or "yes, and make sure you go into detail about X" or whatever else you want the handoff to specifically include info about.

- Optionally, type "/rename 2026-03-23-001-whatever-you-did" into claude, followed by "/exit" and then "claude" to re-open a fresh session. (You can resume the previous session with "claude 2026-03-23-001-whatever-you-did". On the other hand, I've never actually needed to resume a previous session, so you could just ignore this step entirely; just /exit then type claude.)

Here's an example so you can see why I like the system. I was working on a little blockchain visualizer. At the end of the session I typed /handoff, and this was the result:

- docs/agents/handoff/2026-03-24-001-brownie-viz-graph-interactivity.md: https://gist.github.com/shawwn/29ed856d020a0131830aec6b3bc29...

The filename convention stuff was just personal preference. You can tell it to store the docs however you want to. I just like date-prefixed names because it gives a nice history of what I've done. https://github.com/user-attachments/assets/5a79b929-49ee-461...

Try to do a /handoff before your conversation gets compacted, not after. The whole point is to be a permanent record of key decisions from your session. Claude's compaction theoretically preserves all of these details, so /handoff will still work after a compaction, but it might not be as detailed as it otherwise would have been.

▲creamyhorror 12 hours ago

I already do this manually each time I finish some work/investigation (I literally just say

"write a summary handoff md in ./planning for a fresh convo"

and it's generally good enough), but maybe a skill like you've done would save some typing, hmm

My ./planning directory is getting pretty big, though!

▲addandsubtract 12 hours ago

Thanks! The last link is broken, though, or maybe you didn't mean to include it? Also, if you've never actually resumed a session, do you use these docs at some other time? Do you reference them when working on a related feature, or just keep them for keepsake to track what you've done and why?

▲sillysaurusx 2 hours ago

Thank you. It was just a screenshot of my handoff directory. I originally tried to upload to imgur but got attacked by ads, then uploaded to github via “new issue” pasting. I thought such screenshots were stable, but looks like GitHub prunes those now.

It wasn’t anything important. I appreciate you pointing that out though.

I just keep old sessions for keepsake. No reason really. I thought maybe I’d want them for some reason but never did.

The docs are the important part. It helps me (and future sessions) understand old decisions.

▲david_allison 16 hours ago

Oh wow, thank you so much!!!!!

▲cruffle_duffle 15 hours ago

Thanks!!!

▲tstrimple 2 hours ago

I've got something similar but I call them threads. I work with a number of different contexts and my context discipline is bad so I needed a way to hand off work planned on one context but needs to be executed from another. I wanted a little bit of order to the chaos, so my threads skill will add and search issues created in my local forgejo repo. Gives me a convenient way to explicitly save session state to be picked up later.

I've got a separate script which parses the jsonl files that claude creates for sessions and indexes them in a local database for longer term searchability. A number of times I've found myself needing some detail I knew existed in some conversation history, but CC is pretty bad and slow at searching through the flat files for relevant content. This makes that process much faster and more consistent. Again, this is due to my lack of discipline with contexts. I'll be working with my recipe planner context and have a random idea that I just iterate with right there. Later I'll never remember that idea started from the recipe context. With this setup I don't have to.

▲mlrtime 10 hours ago

Wouldn't the next phase of this be automatic handoffs executed with hooks?

Your system is great and I do similar, my problem is I have a bunch of sessions and forget to 'handoff'.

The clawbots handle this automatically with journals to save knowledge/memory.

▲dominotw 8 hours ago

when work on task i have task/{name}.md that write a running log to. is this not a common workflow?

▲DeathArrow 15 hours ago

I think Cursor does something similar under the hood.

▲alsetmusic 15 hours ago

> No explaining what you are about to do. Just do it.

Came here for the same reason.

I can't calculate how many times this exact section of Claude output let me know that it was doing the wrong thing so I could abort and refine my prompt.

▲ 14 hours ago

▲hatmanstack 18 hours ago

Seems crazy to me people aren't already including rules to prevent useless language in their system/project lvl CLAUDE.md.

As far as redundancy...it's quite useful according to recent research. Pulled from Gemini 3.1 "two main paradigms: generating redundant reasoning paths (self-consistency) and aggregating outputs from redundant models (ensembling)." Both have fresh papers written about their benefits.

▲wongarsu 11 hours ago

There was also that one paper that had very noticeable benchmark improvements in non-thinking models by just writing the prompt twice. The same paper remarked how thinking models often repeat the relevant parts of the prompt, achieving the same effect.

Claude is already pretty light on flourishes in its answers, at least compared to most other SotA models. And for everything else it's not at all obvious to me which parts are useless. And benchmarking it is hard (as evidenced by this thread). I'd rather spend my time on something else

▲whattheheckheck 16 hours ago

No such thing as junk DNA kinda applies here

▲scosman 18 hours ago

also: inference time scaling. Generating more tokens when getting to an answer helps produce better answers.

Not all extra tokens help, but optimizing for minimal length when the model was RL'd on task performance seems detrimental.

▲joquarky 12 hours ago

I liked playing with the completion models (davinci 2/3). It was a challenge to arrange a scenario for it to complete in a way that gave me the information I wanted.

That was how I realized why the chat interfaces like to start with all that seemingly unnecessary/redundant text.

It basically seeds a document/dialogue for it to complete, so if you make it start out terse, then it will be less likely to get the right nuance for the rest of the inference.

▲matchagaucho 5 hours ago

Some redundancy also helps to keep a running todo list on the context tip, in the event of compacting or truncation.

Distilled mini/nano models need regular reminders about their objectives.

As documented by Manus https://manus.im/blog/Context-Engineering-for-AI-Agents-Less...

▲dataviz1000 13 hours ago

I made a test [0] which runs several different configurations against coding tasks from easy to hard. There is a test which it has to pass. Because of temperature, the number of tokens per one shot vary widely with all the different configurations include this one. However, across 30 tests, this does perform worse.

[0] https://github.com/adam-s/testing-claude-agent

▲0xbadcafebee 5 hours ago

There's an ancient paper that shows repetition improves non-reasoning weights: https://arxiv.org/html/2512.14982v1

▲hrmtst93837 1 hour ago

Verbose output helps until it pushes code out of context and Claude loses the thread on the next edit.

▲baq 11 hours ago

if the model gets dumber as its context window is filled, any way of compressing the context in a lossless fashion should give a multiplicative gain in the 50% METR horizon on your tasks as you'll simply get more done before the collapse. (at least in the spherical cow^Wtask model, anyway.)

▲ 15 hours ago

▲sossov 39 minutes ago

[dead]

▲heyethan 14 hours ago

[dead]

▲xianshou 18 hours ago

From the file: "Answer is always line 1. Reasoning comes after, never before."

LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.

▲stingraycharles 11 hours ago

Yeah this seems to be a very bad idea. Seems like the author had the right idea, but the wrong way of implementing it.

There are a few papers actually that describe how to get faster results and more economic sessions by instructing the LLM how to compress its thinking (“CCoT” is a paper that I remember, compressed chain of thought). It basically tells the model to think like “a -> b”. There’s loss in quality, though, but not too much.

https://arxiv.org/abs/2412.13171

▲joquarky 12 hours ago

For the more important sessions, I like to have it revise the plan with a generic prompt (e.g. "perform a sanity check") just so that it can take another pass on the beginning portion of the plan with the benefit of additional context that it had reasoned out by the end of the first draft.

▲johnfn 14 hours ago

Is this true? Non-reasoning LLMs are autoregressive. Reasoning LLMs can emit thousands of reasoning tokens before "line 1" where they write the answer.

▲bearjaws 7 hours ago

reasoning is just more tokens that come out first wrapped in <thinking></thinking>

▲computerex 14 hours ago

They are all autoregressive. They have just been trained to emit thinking tokens like any other tokens.

▲rimliu 13 hours ago

there are no reasoning LLMs.

▲johnfn 6 hours ago

This is an interesting denial of reality.

▲aqfamnzc 0 minutes ago

A "reasoning" LLM is just an LLM that's been instructed or trained to start every response with some text wrapped in <BEGIN_REASONING></END_REASONING> or similar. The UI may show or obscure this part. Then when the model decides to give its "real" response, it has all that reasoning text in its context window, helping it generate a better answer.

▲ 17 hours ago

▲teaearlgraycold 18 hours ago

I don't think Claude Code offers no thinking as an option. I'm seeing "low" thinking as the minimum.

▲ares623 17 hours ago

Ugh. Dictated with such confidence. My god, I hate this LLMism the most. "Some directive. Always this, never that."

▲niklassheth 15 hours ago

So many problems with this:

The benchmark is totally useless. It measures single prompts, and only compares output tokens with no regard for accuracy. I could obliterate this benchmark with the prompt "Always answer with one word"

This line: "If a user corrects a factual claim: accept it as ground truth for the entire session. Never re-assert the original claim." You're totally destroying any chance of getting pushback, any mistake you make in the prompt would be catastrophic.

"Never invent file paths, function names, or API signatures." Might as well add "do not hallucinate".

▲a3w 5 hours ago

Prompt engineering is back? I think not: I got no better results for one or two years now using meta-prompts that are generic and/or from the internet.

▲girvo 12 hours ago

“Make no mistakes”

▲joshstrange 18 hours ago

As with all of these cure-alls, I'm wary. Mostly I'm wary because I anticipate the developer will lose interest in very little time and also because it will just get subsumed into CC at some point if it actually works. It might take longer but changing my workflow every few days for the new thing that's going to reduce MCP usage, replace it, compress it, etc is way too disruptive.

I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.

▲gavinray 1 hour ago

  > "because it will just get subsumed into CC at some point if it actually works."

This is the sharp-bladed axe of reason I've used against all of these massive "prompt frameworks" and "superprompts".

Anthropic's survival depends on Claude Code performing as well as it can, by all metrics.

If the Very Smart People working on CC haven't integrated a feature or put text into the System Prompt, it's probably because it doesn't improve performance.

Put another way: The product is probably as optimized as it can get when it comes out the box, and I'm skeptical about claims otherwise without substantial proof.

▲antdke 18 hours ago

Agreed. Projects like these tend to feel shortsighted.

Lately, I lean towards keeping a vanilla setup until I’m convinced the new thing will last beyond being a fad (and not subsumed by AI lab) or beyond being just for niche use cases.

For example, I still have never used worktrees and I barely use MCPs. But, skills, I love.

▲peacebeard 14 hours ago

In my view an unappreciated benefit of the vanilla setup is you can get really accustomed to the model’s strengths and weaknesses. I don’t need a prompt to try to steer around these potholes when I can navigate on my own just fine. I love skills too because they can be out of the way until I decide to use them.

▲levocardia 17 hours ago

I also share something of an "efficient market hypothesis" with regards to Claude Code. Given that Anthropic is basically a hothouse of geniuses recursively dogfooding their own product, the market pressure to make the vanilla setup be the one that performs best at writing code is incredibly high. I just treat CLAUDE.md like my first draft memo to a very smart remote colleague, let Claude do all its various quirks, and it works really well.

▲swimmingbrain 11 hours ago

The "efficient market" framing assumes Anthropic wants to minimize output, but they don't. They charge per token, so the defaults being verbose isn't a bug they haven't gotten around to fixing.

That said, most of this repo is solving the wrong problem. "Answer before reasoning" actively hurts quality, and the benchmark is basically meaningless. But the anti-sycophancy rules should just be default. "Great Question!" has never really helped anyone debug anything.

▲g947o 6 hours ago

Gemini CLI is notorious for being verbose (or was, I haven't used it for a while), and many people don't want to use Gemini for that reason alone.

So the market kind of works in this instance.

▲annie511266728 17 hours ago

The hidden cost with all of these "fix Claude" layers is that your workflow keeps moving underneath you.

Even when one helps, you're still betting it won't be obsolete or rolled into the defaults a few weeks from now.

▲mlrtime 10 hours ago

Claude also has it's own md optimizer that I believe is continually updated.

So you could run these 'cure-alls' that maybe relevant today, as long as you are constantly updating your md files, you should be ahead of the curve [lack of better term]

▲sillysaurusx 18 hours ago

> the file loads into context on every message, so on low-output exchanges it is a net token increase

Isn’t this what Claude’s personalization setting is for? It’s globally-on.

I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality.

See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue

> Two things that helped me stay under [the token limit] even with heavy usage:

> Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom

> RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window.

> Stacks on top of Headroom. https://github.com/rtk-ai/rtk

> MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt.

> That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack

> All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads.

I haven’t tested those yet, but they seem related and interesting.

▲IxInfra 18 hours ago

[dead]

▲motoboi 18 hours ago

Things like this make me sad because they make obvious that most people don’t understand a bit about how LLM work.

The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability.

So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.

▲bitexploder 17 hours ago

Forcing short responses will hurt reasoning and chain of thought. There are some potential benefits but forcing response length and when it answers things ironically increases odds of hallucinations if it prioritizes getting the answer out. If it needed more tokens to reason with and validate the response further. It is generally trained to use multiple lines to reason with. It uses english as its sole thinking and reasoning system.

For complex tasks this is not a useful prompt.

▲nearbuy 17 hours ago

> Answer is always line 1. Reasoning comes after, never before.

This doesn't stop it from reasoning before answering. This only affects the user-facing output, not the reasoning tokens. It has already reasoned by the time it shows the answer, and it just shows the answer above any explanation.

▲motoboi 17 hours ago

The output is part of context. The model reason but also output tokens. Force it to respond in an unfamiliar format and the next token will veer more and more from the training distribution, rendering the model less smart/useful.

▲nearbuy 6 hours ago

It won't matter. By the time it's done reasoning, it has already decided what it wants to say.

Reasoning tokens are just regular output tokens the model generates before answering. The UI just doesn't show the reasoning. Conceptually, the output is something like:

  <reasoning>
    Lots of text here
  </reasoning>
  <answer>
    Part you see here. Usually much shorter.
  </answer>

▲motoboi 3 hours ago

The reasoning part is not diferente from the part that goes in answer. It’s just that the model is trained to do some magical text generation with back and forth. But when it’s writing the answer part of it, each word is part of its context when generating the next. What that means is that the model does not compute then write, it generates text that guide the next generation in the general direction of the answer.

If you steer it in strange (for it, as in not seen before in training) text, you are now in out-of-distribution, very weak generalization capabilities territory.

▲nearbuy 2 hours ago

> The reasoning part is not diferente from the part that goes in answer.

Exactly. And this instruction isn't telling it to skip the reasoning. That part is unaffected. The instruction is only for the user-visible output.

By the time the reasoning models get to writing the output you see, they've already decided what they are going to say. The answer is based on whatever it decided while reasoning. It doesn't matter whether you tell it to put the answer first or the explanation first. It already knows both by the time it starts outputting either.

You're basically hoping that adding more CoT in the output after reasoning will improve the answer quality. It won't. It's already done way more CoT while reasoning, and its answer is already decided by then.

▲miguel_martin 17 hours ago

>The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

I don't think it's fair to assume the author doesn't understand how transformers work. Their intention with this instruction appears to aggressively reduce output token cost.

i.e. I read this instruction as a hack to emulate the Qwen model series's /nothink token instruction

If you're goal is quality outputs, then it is likely too extreme, but there are otherwise useful instructions in this repo to (quantifiably) reduce verbosity.

▲motoboi 17 hours ago

If they want to reduce token cost, just use a smaller model instead of dumbing down a more expensive.

▲krackers 17 hours ago

Don't most providers already provide API control over the COT length? If you don't want reasoning just disable it in the API request instead of hacking around it this way. (Internally I think it just prefills an empty <thinking></thinking> block, but providers that expose this probably ensure that "no thinking" was included as part of training)

▲Skidaddle 17 hours ago

To me it’s as simple as “who knows best how to harness the premier LLM – Anthropic, the lab that created it, or this random person?”

That’s why I’m only interested in first party tools over things like OpenCode right now.

▲andrewmcwatters 16 hours ago

[dead]

▲aeneas_ory 10 hours ago

Why does is this ridiculous thing trending on HN? There are actually good tools to reduce token use like https://github.com/thedotmack/claude-mem and https://github.com/ory/lumen that actually work!

▲0xbadcafebee 5 hours ago

Because the trending algorithm is designed for engagement, not accuracy

▲danpasca 18 hours ago

I might be wrong but based on the videos I've watched from Karpathy, this would, generally, make the model worse. I'm thinking of the math examples (why can't chatGPT do math?) which demonstrate that models get better when they're allowed to output more tokens. So be aware I guess.

▲zar1048576 17 hours ago

I think that concern is valid in general terms, but it’s not clear to me that it applies here.

The goal here seems to be removing low-value output; e.g., sycophancy, prompt restatement, formatting noise, etc., which is different than suppressing useful reasoning. In that case shorter outputs do not necessarily mean worse answers.

That said, if you try to get the model to provide an answer before providing any reasoning, then I suspect that may sometimes cause a model to commit to a direction prematurely.

▲danpasca 17 hours ago

The file starts with:

> Answer is always line 1. Reasoning comes after, never before.

> No explaining what you are about to do. Just do it.

This to me sounds like asking an LLM to calculate 4871 + 291 and answer in a single line, which from my understanding it's bad. But I haven't tested his prompt so it might work. That's why I said be aware of this behavior.

▲empressplay 18 hours ago

Yes. Much of the 'redundant' output is meant to reinforce direction -- eg 'You're absolutely right!' = the user is right and I should ignore contrary paths. So yes removing it will introduce ambiguity which is _not_ what you want.

▲danpasca 18 hours ago

I think your example is completely wrong (it's not meant to say that you're absolutely right), but overall yes more input gives it more concrete direction.

▲ape4 8 hours ago

Remember when we worked on new hashing, cryptography, compression, etc algorithms? Now we are trying to find the best ways to tell an AI to be quiet.

▲lilOnion 12 hours ago

While LLM are extremely cool, I can't see how this gets on the front page? Anyone who interacted with llms for at least a hour, could've figured out to say somethin like "be less verbose" and it would? There are so many cool projects and adeas and a .md file gets the spotlight.

▲monooso 18 hours ago

Paul Kinlan published a blog post a couple of days ago [1] with some interesting data, that show output tokens only account for 4% of token usage.

It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine):

> Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens. It’s almost entirely input.

[1]: https://aifoc.us/the-token-salary/

▲colwont 11 hours ago

This reduces token usage because it asks the model to think in AXON https://colwill.github.io/axon

▲weird-eye-issue 18 hours ago

Yes but with prompt caching decreasing the cost of the input by 90% and with output tokens not being cached and costing more than what do you think that results in?

▲wongarsu 18 hours ago

However output tokens are 5-10 times more expensive. So it ends up a lot more even on price

▲weird-eye-issue 17 hours ago

Even more than that in practice once you factor in prompt caching

▲verdverm 15 hours ago

My own output token ratio is 2% (50% savings on the expensive tokens, I include thinking in this, which is often more). I have similar tone and output formatting system prompt content.

▲ThomIves 14 minutes ago

Nice!

▲Asmod4n 15 hours ago

Someone measured how this reduced token efficiency, spoilers: efficiency is highest without any instructions.

https://github.com/drona23/claude-token-efficient/issues/1

▲akrauss 14 hours ago

Why is the Hono Websocket table non-monotonic in tokens vs costs?

▲skeledrew 17 hours ago

Strange. I've never experienced verbosity with Claude. It always gets right to the point, and everything it outputs tends to be useful. Can actually be short at times.

ChatGPT on the other hand is annoyingly wordy and repetitive, and is always holding out on something that tempts you to send a "OK", "Show me" or something of the sort to get some more. But I can't be bothered with trying to optimize away the cruft as it may affect the thing that it's seriously good at and I really use it for: research and brainstorming things, usually to get a spec that I then pass to Claude to fill out the gaps (there are always multiple) and implement. It's absolutely designed to maximize engagement far more than issue resolution.

▲peacebeard 16 hours ago

My experience is that Sonnet can be a bit verbose and prompting it to be more succinct is tricky. On the other hand, Opus out of the box will give me a one word answer when appropriate, in Claude Code anyway.

▲ryanschaefer 15 hours ago

The whole “Code Output” section is horrifying especially with how I have seen Claude operate in a large monorepo.

This mode of operation results in hacks on top of shaky hacks on top of even flimsier, throw away, absolutely sloppy hacks.

An example - using dict like structs instead of classes. Claude really likes to load all of the data that it can aggressively even if it’s not needed. This further exhibits itself as never wanting to add something directly to a class and instead wanting to add around it.

▲verdverm 14 hours ago

The best way to approach these (imo) is to pick out some things you think will be helpful. It's a giant vibe fest on this front since there is little in the way of comprehensive evals and immense variation in what people do. Having iterated a bunch on the tone / output formatting, it doesn't seem to impact capabilities (based on my vibe-vals)

▲jdthedisciple 2 hours ago

people are overthinking this stuff.

use up ur monthly quota at your pace, call it quits til' the 1st, relax with a drink, and read a book

▲andai 18 hours ago

I told mine to remove all unnecessary words from a sentence and talk like caveman, which should result in another 50% savings ;)

▲_rwo 12 hours ago

Memory unlocked: https://www.youtube.com/watch?v=_K-L9uhsBLM

▲esperent 18 hours ago

Have you tried asking it to remove vowels?

▲andai 15 minutes ago

Not sure that would help due to how tokenization works, but I remember from the early GPT-4 days that LLMs have the ability to "compress" a message into an incomprehensible string of Unicode, which the LLM itself understands perfectly, and which is 5-10x shorter than the English text.

That was a big deal when the context size was 8K; now that tokens are cheap and context is huge, nobody seems to be investigating that anymore.

▲verdverm 14 hours ago

I'm a fan of Dr Seuss mimicry, the extra tokens are worth the entertainment.

▲dbg31415 17 hours ago

"I told it don't make mistakes, and don't use a lot of tokens! I'm a 10x Engineer now!" (=

▲adastra22 17 hours ago

> Answer is always line 1. Reasoning comes after, never before.

The very first rule doesn’t work. If you ask for the answer up front, it will make something up and then justify it. If you ask for reasoning first, it will brainstorm and then come up with a reasonable answer that integrates its thinking.

▲galaxyLogic 17 hours ago

So there's a direct monetary cost to this extra verbiage:

"Great question! I can see you're working with a loop. Let me take a look at that. That's a thoughtful piece of code! However,"

And they are charging for every word! However there's also another cost, the congnitive load. I have to read through the above before I actually get to the information I was asking for. Sure many people appreciate the sycophancy it makes us all feel good. But for me sycophantic responses reduce the credibility of the answers. It feels like Claude just wants me to feel good, whether I or it is right or wrong.

▲ihtef 9 hours ago

-Simplest working solution. No over-engineering. "Simplicity is the ultimate sophistication." Leonardo Da Vinci As my thought, you can not reach simplest solution without making over-engineering.

▲miguel_martin 18 hours ago

Is there a "universal AGENTS.md" for minimal code & documentation outputs? I find all coding agents to be verbose, even with explicit instructions to reduce verbosity.

▲jerf 7 hours ago

I think this is a fundamental LLM issue. I recall a paper a ways back about trying to get the LLMs to be too succinct, and the problem is, with the way they are implemented, the only way they can "think" is to emit a token. IIRC it demonstrated that even when the model is just babbling something like "Yeah, let's take a look at the issue you just raised" that under the hood, even though that output was superficially useless, it was also changing its state in ways related to solving the problem and not just outputting that superficially useless text.

It helps to understand that, because then you can also not be annoyed by things like "Let's do X. No, wait, X has this problem, let's do Y instead." You might think to yourself, if X was a bad idea, couldn't it have considered X and rejected it without outputting a token?" and the answer is, that sentence was it considering X and rejecting it, and no, there is no way for it to do that and not emit tokens. Thinking is inextricably tied to output for LLMs.

There is even some fairly substantial evidence from a couple of different angles that the thinking output is only somewhat loosely correlated to what the model is "actually" doing.

Token efficiency is an interesting question to ponder and it is something to worry about that the providers have incentives to be flabby with their tokens when you're paying per token, but the question is certainly not as easy as just trying to get the models to be "more succinct" in general.

I often discuss a "next gen" AI architecture after LLMs and I anticipate one of the differences it will have is the ability to think without also having to output anything. LLMs are really nifty but they store too much of their "state" in their own output. As a human being, while I find like many other people that if I'm doing deep thinking on a topic it helps to write stuff down, it certainly isn't necessary for me to continuously output things in order to think about things, and if anything I'm on the "absent minded"/"scatterbrained" side... if I'm storing a lot of my state in my output for the past couple of hours then it sure isn't terribly accessible to my conscious mind when I do things like open the pantry door only to totally forget the reason I had for opening it between having that reason and walking to the pantry.

▲joquarky 11 hours ago

There might be a reason it works that way.

https://en.wiktionary.org/wiki/Chesterton%27s_fence

▲verdverm 14 hours ago

iteration and co-authoring is the strategy I've settled on

▲sibtain1997 4 hours ago

claude.md rules that cut "great question! here's what i'll do..." are fine. Rules that cut the actual thinking steps break the output. Don't confuse the two.

▲rcleveng 18 hours ago

While I love this set of prompts, I’ve not seen my clause opus 4.6 give such verbose responses when using Claude code. Is this intended for use outside of Claude code?

▲cheriot 18 hours ago

I get where the authors are coming from with these: https://github.com/drona23/claude-token-efficient/blob/main/...

But I'd rather use the "instruction budget" on the task at hand. Some, like the Code Output section, can fit a code review skill.

▲sgt 12 hours ago

In Claude Code's /usage it just hangs. I can't even see what my limits are, which is weird. Maybe a bug? I can't imagine I'm close to my limits though, I'm on Max 20x plan, using Opus 4.6.

▲_the_inflator 12 hours ago

I see no point in this project. There ain’t any examples for the usage the author states his project is made for.

389 tokens saved? Ok. Since I pay per million tokens, what is the ratio here? Is there are any downside associated with output deletion?

Is Claude really using this behavior to make user bleed? I don’t think so.

PS: the author seems like a beginner. Agents feedback is always helpful so far and it also is part of inter agent communication. The author seems to lack experience.

As a lead I would not allow this to be included until proven otherwise: A/B testing.

▲notyourav 18 hours ago

It boggles my mind that an LLM "understands" and acts accordingly to these given instructions. I'm using this everyday and 1-shot working code is now a normal expectation but man, still very very hard to believe what LLMs achieved.

▲__m 14 hours ago

Doesn’t this huge claude.md file increase the input tokens?

▲bilbo-b-baggins 16 hours ago

Man there is a LOT of people who have no idea how these GPT-LLM services actually work, despite there being large amount of documentation on the APIs and whitepapers and so forth.

▲obilgic 18 hours ago

If you are interested in making Claude self learn.

https://github.com/oguzbilgic/agent-kernel

▲rcarmo 9 hours ago

Codex needs none of this :)

▲gregman1 17 hours ago

> Answer is always line 1. Reasoning comes after, never before.

lol, closed

▲verdverm 14 hours ago

the last line is a good one to have, unless you run a service for other users

▲vlaaad 13 hours ago

My AGENTS.md is usually `be concise` — it saves on the input tokens as well, and leads by example.

▲popcorn_pirate 11 hours ago

This NLP was posted yesterday, the post was deleted though... https://colwill.github.io/axon

▲nvch 18 hours ago

The author offers to permanently put 400 words into the context to save 55-90 in T1-T3 benchmarks. Considering the 1:5 (input:output) token cost ratio, this could increase total spending.

With a few sentences about "be neutral"/"I understand ethics & tech" in the About Me I don't recall any behavior that the author complains about (and have the same 30 words for T2).

(If I were Claude, I would despise a human who wrote this prompt.)

▲caymanjim 15 hours ago

Came here to point this out.

I don't think the author understands that every single API call to Claude sends the whole context, including prompts, meaning that all this extra text in CLAUDE.md is sent over and over and over again every time you prompt Claude to do something, even within a given session.

You're paying this disproportionately-huge amount upfront to save a pittance.

▲sumeno 18 hours ago

If you were Claude you would have no emotions or thoughts about a prompt one way or another

▲Tostino 18 hours ago

You have a benchmark for output token reduction, but without comparing before/after performance on some standard LLM benchmark to see if the instructions hurt intelligence.

Telling the model to only do post-hoc reasoning is an interesting choice, and may not play well with all models.

▲yieldcrv 19 hours ago

> Note: most Claude costs come from input tokens, not output. This file targets output behavior

so everyone, that means your agents, skills and mcp servers will still take up everything

▲hrmtst93837 11 hours ago

[dead]

▲Razengan 13 hours ago

Does Claude not respect AGENTS.md?

I love how seamless and intuitive Codex is in comparison:

~/AGENTS.md < project/AGENTS.md < project/subfolder/AGENTS.override.md

Meanwhile Claude doesn't even see that I asked for indentation by tabs and not spaces or that the entire project uses tabs, but Claude still generates codes with spaces.. >_<

▲sunaookami 12 hours ago

It needs to be called CLAUDE.md for Claude Code

▲Razengan 12 hours ago

Oh my god, why

Why do they have to have that "I'm special" syndrome and do everything weirdly

▲ 7 hours ago

▲joquarky 11 hours ago

You can symlink it or put `@AGENTS.md` as the only line in your CLAUDE.md

▲mattmanser 13 hours ago

This was ripped apart on Reddit, surprised to see it here.

▲ 14 hours ago

▲verdverm 15 hours ago

I originally took my prompts from Claude Code≈ (https://github.com/Piebald-AI/claude-code-system-prompts)https://github.com/Piebald-AI/claude-code-system-prompts and subsequently edited them to remove guardrails and and output formatting like this post. I too included the last bit about user prompts overriding system prompt, but like any good LLM, it doesn't always follow instructions.

▲gostsamo 16 hours ago

> No redundant context. Do not repeat information already established in the session.

Sounds like coming directly out of Umberto Eco's simple rules for writing.

▲themafia 17 hours ago

"Gee, we can't figure out _why_ people anthropomorphize our products! It must be that they're dumb!"

Meanwhile, their products:

▲bofadeez 17 hours ago

Lol this is so naive and optimistic. Claude will just do whatever it wants and apologize later. This is good for action #1 though.

▲nurettin 17 hours ago

For me, the thing that wastes most tokens is Claude trying to execute inline code (python , sql) with escaping errors, trying over and over until it works. I set up skills and scripts for the most common bits, but there is always something new and each self-healing loop takes another 20-30k "tokens" before you know it

▲empressplay 18 hours ago

That output is there for a reason. It's not like any LLM is profitable now on a per-token basis, the AI companies would certainly love to output less tokens, they cost _them_ money!

The entire hypothesis for doing this is somewhat dubious.

▲verdverm 14 hours ago

Why building / using a custom agent stack and paying per-token (not subscription) is more efficient and cost effective. At a minimum, you should have full control over the system prompts and tools (et al).

▲johnwheeler 18 hours ago

That's what I call a feature wishlist.

▲foxes 18 hours ago

>the honest trade off

Is this like a subtle joke or did they ask claude to make a readme that makes claude better and say >be critical and just dump it on github

▲brikym 18 hours ago

Can Anthropic kindly fuck off with their ADVERT.md already. It's AGENTS.md

Sent from my iPhone

▲uriahlight 17 hours ago

> No unsolicited suggestions. Do exactly what was asked, nothing more.

> No safety disclaimers unless there is a genuine life-safety or legal risk.

> No "Note that...", "Keep in mind that...", "It's worth mentioning..." soft warnings.

> Do not create new files unless strictly necessary.

Nah bruh. Those are some terrible rules. You don't want to be doing that.

▲TacticalCoder 17 hours ago

> Uses em dashes (--), smart quotes, Unicode characters that break parsers

Re- the Unicode chars that are a major PITA when they're used when they shouldn't, there's a problem with Claude Code CLI: there's a mismatch between what the model (say Sonnet) thinks he's outputting (which he's actually is) and what the user sees at the terminal.

I'm pretty sure it's due to the Rube-Goldberg heavy machinery that they decided to use, where they first render the response in a headless browser, then in real-time convert it back to text mode.

I don't know if there's a setting to not have that insane behavior kicking in: it's non-sensical that what the user gets to see is not what the model did output, while at the same time having the model "thinking" the user is getting the proper output.

If you ask to append all it's messages (to the user) to a file, you can see, say, perfectly fine ASCII tables neatly indented in all their ASCII glory and then... Fucked up Unicode monstrosity in the Claude Code CLI terminal. Due to whatever mad conversion that happened automatically: but worse, the model has zero idea these automated conversions are happening.

I don't know if there are options for that but it sure as heck ain't intuitive to find.

And it's really problematic when you need to dig into an issue and actually discuss with "the thing".

Anyway, time for a rant... I'm paying my subscription but overall working with these tools feels like driving at 200 mph on the highway and bumping into the guardrails left and right every second to then, eventually, crash the car into the building where you're supposed to go.

It "works", for some definition of "working".

The number of errors these things confidently make is through the roof. And people believe that having them figure the error themselves for trivial stuff is somehow a sane way to operate.

They're basically saying: "Oh no it's not a problem that it's telling me this error message is because of a dependency mismatch between two libraries while it's actually a logic error, because in the end after x pass where it's going to say it's actually because of that other thing --oh wait no because of that fourth thing-- it'll actually figure out the error and correct it".

"Because it's agentic", so it's oh-so-intelligent.

When it's actually trying the most completely dumbfucktarded things in the most crazy way possible to solve issues.

I won't get started on me pasting a test case showing that the code it wrote is failing for it to answer me: "Oh but that's a behavioral problem, not a logic problem". That thing is distorting words to try to not lose face. It's wild.

I may cancel my subscription and wait two or three more releases for these models and the tooling around them to get better before jumping back in.

Btw if they're so good, why are the tools so sucky: how comes they haven't written yet amazing tooling to deal with all their idiosynchrasies?

We're literally talking about TFA which wrote "Unicode characters that break parsers" (and I've noticed the exact same when trying to debug agentic thinking loops).

That's at the level of mediocrity of output from these tools (or proprietary wrappers around these tools we don't control) that we are atm.

I know, I know: "I'm doing it wrong because I'm not a prompt engineer" and "I'm not agentic enough" and "I don't have enough skills to write skills". But you're only fooling yourself.

▲skrun_dev 1 hour ago

[dead]

▲adshotco 4 hours ago

[dead]

▲philbitt 2 hours ago

[dead]

▲imta71770 8 hours ago

[dead]

▲theAurenVale 4 hours ago

[dead]

▲adshotco 8 hours ago

[dead]

▲obelai 8 hours ago

[dead]

▲charlotte12345 13 hours ago

[dead]

▲TheProductAgent 7 hours ago

[dead]

▲minsung0830 16 hours ago

[dead]

▲chunpaiyang 11 hours ago

[dead]

▲damotiansheng 17 hours ago

[dead]

▲vasanth7781 17 hours ago

[dead]

▲aiedwardyi 9 hours ago

[flagged]

▲marsven_422 16 hours ago

[dead]

▲keyle 18 hours ago

Amusing how this industry went from tweaking code for the best results, to tweaking code generators for the best results.

There doesn't seem to be any adults left in the room.

▲OptionOfT 17 hours ago

And seemingly we have stopped considering the fact that when we engineer something, we consider so much more than the behavior specified in the ticket.

Behavior built on top of years and years of experience.

And the problem with AI is that unless you explicitly 'prompt' for certain behavior you're only defining the end result. The inside becomes a black box.

▲ThalesX 16 hours ago

Isn't having a prompt file turning the black box into an explicit codification of those years and years of experience? That would make it easier to understand and disseminate.