You give it a problem, you then refine that problem where a fast, cheaper model asks you questions which you answer to get a better input prompt. You then choose a MA strategy for example take problem break up to sections then final judge concludes or you do multi turn where agents debate then judge summarises debate.
The best approach is what I call 'all angles' where all these strategies run in parallel the final meta-judge synthesise the response - the most useful part of this which I recently added is a view to see the variance in each strategy.
Been using this for life stuff - housing search, schools, family challenges!
Perhaps I should make a video of it in action if people in HN community interested let me know.
I don't think a specific harness is even necessary to get a boost from 'Refine'. Even a simple custom agent is portable enough... it's easy enough to take the existing 'Plan' agent definition present in VS Code and tweak it to be 'Refine' instead.
Such drastic changes tell me that pricing of tokens is arbitrary, and AI business is running out of money fast.
Taking SpaceX as an example, they have increased prices across all their consumer products over the past six months. But they definitely aren't short on money with Alphabet and Anthropic combined paying them over $2 billion per month.
Microsoft/GitHub lost out here as they were just repacking other people's products.
Rumors are worth squat when they’re most likely put in motion by the people with a vested interest in this industry.
Let’s talk about profits when there’s real data from the IPO documentation.
You can make some educated guesses and find out some limits on inferencing cost by looking at 3rd party providers on platforms like openrouter. You can get some median cost /tok for a given model size. Then make some educated guesses on SotA model sizes, and you can get an estimate on pure cost of serving a model. Error bars and all that, of course. But still a range, with some limits.
Also I mean prices in generally for all things are based on underlying factors, that doesn't make them arbitary (i.e. github executives using a random number generator for token pricing would be arbitary)
I'm seeing a ratio of around 10:1 in my usage. A vast majority of the tokens consumed are on the input side. The agent will often read a million tokens just to patch one line of code.
I think if you are seeing something closer to 1:1 or more on the output side, there is either a problem with the agent or the codebase is new / empty.
A million tokens (not cached) sounds like a lot.
I still don't understand how caching helps me very much. I must be misunderstanding it because I thought the user's prompt (which is the biggest variable) necessarily sits prior to all of these token intensive tool calls. How can we cache the reading of codebase if the prefix is always moving?
A new instruction by the user will be appended at the end if it done in the same conversation. Thus only has influence on the cacheability of the original agent prompt, but not of subsequent tool calls.
Has ai forgotten about high level design? Surely all it needs to know is what the methods, objects or functions in the code base actually does and the actual code it is meant to be fixing?
I wonder if half the issues is that the LLM try to change too much?
But, does every prompt need the entire codebase?
If you want a difference kind of dynamic testing besides unit tests, have you tried writing it in as a requirement during the planning/PRD phase?
Their interests are often not your interests. In this case they want you to unnecessary money on useless work (let's stop the euphemism of "tokens" btw)
/s
[1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Stee...
I wonder what hyperscaled compute farms and models will be good for at that running cost when most AI needs can be fulfilled by on-prem and on-device hardware and models. Probably only customer left are big governments. So in the end the tax payer has to pay for those billions of investments by the AI cartel.
Also, smaller models can obviously be used but a smaller model will be a lot weaker in real-world knowledge and this tends to limit their smarts in a way that can't be compensated by more thinking.
Was in a meeting reviewing a potential new product, it was going well until they showed us that they had added AI to it (of course they have). It was pretty obviously just shoehorned in, and one part of that obviousness was that they had a column that showed how many tokens it took to make each query.
I asked who is paying for the tokens, they said its included in the license. I said, so is there a budget or is it all you can eat. they said good question they didnt know and would get back to me. I said the reason i asked was just one query there had a 250k token burn on it. and it was a fairly simple query about one device.
then, one of the execs on their side was heard saying out loud "Why are we even showing this to the customers?"
it have us quite a chuckle. But lesson learned... the cost of adding AI to anything isnt really being accounted for let alone the true cost of actually running the AI.
all things AI are going to get more expensive. even if you dont want the AI aspect.
Maybe soon companies will look at how engineers can optimize the token efficiency of AI.
Now that we have pretty decent open source models, anyone can create a new business to supply more tokens. Sure there’s short term scarcity: energy, GPUs, cooling, but this is a scale up problem. More token demand = more data center build = more energy plant build. This downward pressure will also keep frontier private model prices in check.
Differentiation seems to be happening at the harness level, whereby we can expect token spend to be a metric to compete on and drive down for the customer (at least hoping tools in the application space don’t continue token based billing as their primary revenue stream).
These are not short term hyper growth forces, but a fundamental alignment of incentives.
But we’re seeing lots of open weight models that are either pretty close to SToA, or more importantly, perfectly capable of doing all the low level token insensitive grunt work when writing code. Pairing them with SToA models for long horizon task management, and you’ve got a very cost effective system.
The frontier labs have put little effort into cost efficient inference, they don’t need to, but folks like DeepSeek clearly are, and have achieved some impressive cost improvements. Given DeepSeeks models give you 70% of the capabilities for 30% of the cost, expect people to start moving lots of workloads to providers that provide cheap inference for open models, and huge competition to appear to provide that cheap inference. It’s truly commodity LLM inference.
In turn expect more companies to focus on building inferences efficient models, because someone that can build a model that provides 70% of SToA capabilities for 10% of the token cost, immediately eats up huge amounts of the available inference market.
Another factor in all this, is it’s becoming increasingly clear that building custom agents/workflows for LLM to operate in, is required to get the best out of these models. That means people are implicitly building the infra needed to use multiple model types and evaluate workflow performance end-to-end. Which in turn means they have everything they need to plugin in future, cheaper, inference providers and quickly evaluate if they can change their model provider.
In the other direction models continue to grow larger, new customers continue to arrive, and existing customers continue to find ever more creative ways to burn large quantities of tokens as the prices fall.
I doubt anyone can say with certainty where the equilibrium will be 1 or 5 years from now largely because (among many other things) it's impossible to predict how much of the current economy AI will end up eating. In general though the third party providers of open weights models are probably the most reliable data source available since they have little to no incentive to subsidize usage.
Betting against that you need to assume exponentially more expensive models every year.
i don't think a lot of people know this, but a cluster of GPUs can serve multiple clients without much of a drop in performance, e.i. worst case scenario you band together with 6-16 people to run a 2-3 H100 server to host deepseek V4 Flash or 4-6 to run Pro, and you're getting the same performance as if you ran it alone, this means a lot of companies can afford throwing 50-100k into their own LLM server cluster.
We're at a price point where if you push it further people will move, there's no real vendor lock in, your agent config, skills, MCP servers etc are all reusable with other models and harnesses, so unless you get all providers to collude on a price hike, you risk an exodus of customers
You're welcome! =)
So what? Terms are reused in different contexts all the time. And most people have moved on from cryptocurrencies anyway, so there’s little chance it’ll confuse anyone.