It's a massive red flag to me when you could get decent data to see if your thing actually works, and they don't even attempt to...
Have the LLM use your tool, run it on several of the coding benchmarks. If you're stingy, run it on the ones that don't cost much.
Otherwise, I'm going to assume it doesn't actually work. If it did - Claude, Antigravity, Codex, Pi, or some major player would bundle tools like this into the CLI / harness.
AFAIK, none of the major players do. That's a sign to me these don't work in general.
I've tried building some tools specific to bug fixing. Intelligently feeding context massively helps smaller models. But, what I've found - surprisingly - is that a smaller, much better focused, including a lot of helpful data as well, has almost no impact on larger models compared to what they do by default.
You do save some tokens, though, which is what they're claiming - but not ~99%...
VS Code launched it as a feature in their bundled AI functionality last month: https://code.visualstudio.com/updates/v1_121
Your suggestion to using coding benchmarks doesn't really capture the whole picture. I haven't seen a benchmark using kubectl.
> AFAIK, none of the major players do. That's a sign to me these don't work in general.
It's a lose/lose for major players. If it works well, it will lower their revenue. Also there's a high risk it'll significantly worsen results for some people, even if it improves results for others.
I don't know about cost saving, but if it's keeping the context size down I've had a lot better results using subagents to keep a higher order conversation clean for longer.
What would be useful:
- examples of text that can be filtered, and why that would be valuable
- a data flow diagram of runtime behavior, showing how filtering removes unnecessary contextBut the one thing I expected to see in the Readme was an example of: takes this tool run output: XXXXXX and converts it to: XX for a savings of 40% of tokens.
This looks like a nice (and useful) project, so thanks for sharing!
Pro tip they worked well for me with response truncation: in the truncated output, say that the full text is available in /tmp/whereever.txt - that way, the llm will be able to query and read more using built in tools without reissuing the big tool call.
A proper benchmark will compare a large sample of identical prompting with and without the tool, against a specific harness. Once you apply Amdahl’s law, there is no way this saves 91% of tokens holistically, which the title implies.
I work in a non-tech company and these sorts of things keep going viral, with no understanding and with no comprehension of what is actually going on. Engineering is gone and cargo cult magical incantations are in.