I've found that adopting RFC Keywords (e.g. RFC 2119 [1]; MUST, SHOULD, MAY) at least makes the LLM report satisfaction. I'd love to see a proper study on the usage of RFC keywords and their effect on compliance and effectiveness.
You are missing another dimension how easy it would be to migrate if adding new feature hits a ceiling and LLM keeps breaking the system.
Imagine all tests are passing and code is confirming the spec, but everything is denormalized because LLM thought this was a nice idea at the beginning since no one mentioned that requirement in the spec. After a while you want to add a feature which requires normalized table and LLM keeps failing, but you also have no idea how this complex system works.
Don't forget that very very detailed spec is actually the code
The constant urge I have today is for some sort of spec or simpler facts to be continuously verified at any point in the development process; Something agents would need to be aware of. I agree with the blog and think it's going to become a team sport to manage these requirements. I'm going to try this out by evolving my open source tool [1] (used to review specs and code) into a bit more of a collaborative & integrated plane for product specs/facts - https://plannotator.ai/workspaces/
I also tend to find especially that there's a lot of cruft in human written spec languages - which makes them overly verbose once you really get into the details of how all of this works, so you could chop a lot of that out with a good spec language
I nominate that we call this completely novel, evolving discipline: 'programming'
Shame us all for moving away from something so perfect, precise, and that "doesn't have edge cases."
Hey - if you invent a programming language that can be used in such a way and create guaranteed deterministic behavior based on expressed desires as simple as natural language - ill pay a $200/m subscription for it.
In ancient times we had tech to do exactly that: Programming languages and tests.
It all failed. For a simple reason, popularized by Joel Spolsky: if you want to create specification that describes precisely what software is doing and how it is doing its job, then, well, you need to write that damn program using MS Word or Markdown, which is neither practical nor easy.
The new buzzword is "spec driven development", maybe it will work this time, but I would not bet on that right now.
BTW: when we will be at this point, it does not make sense anymore to generate code in programming languages we have today, LLM can simply generate binaries or at least some AST that will be directly translated to binary. In this way LISP would, eventually, take over the world!.
In the new world of mostly-AI code that is mostly not going to be properly reviewed or understood by humans, having a more and more robust manifestation and enforcement, and regeneration of the specs via the coding harness configuration combined with good old fashioned deterministic checks is one potential answer.
Taken to an extreme, the code doesn’t matter, it’s just another artifact generated by the specs, made manifest through the coding harness configuration and CI. If cost didn’t matter, you could re-generate code from scratch every time the specs/config change, and treat the specs/config as the new thing that you need to understand and maintain.
“Clean room code generation-compiler-thing.”
Is it? All the electricity and capital investment in computing hardware costs real money. Is this properly reflected in the fees that AI companies charge or is venture capital propping each one up in the hope that they will kill off the competition before they run out of (usually other people's) money?
So something which must be true if this author is right is that whatever the new language is—the thing people are typing into markdown—must be able to express the same rigor in less words than existing source code.
Otherwise the result is just legacy coding in a new programming language.
And this is why starting with COBOL and through various implementations of CASE tools, "software through pictures" or flowcharts or UML, etc, which were supposed to let business SMEs write software without needing programmers, have all failed to achieve that goal.
I think it's an open question of whether we achieve the holy grail language as the submission describes. My guess is that we inch towards the submission's direction, even if we never achieve it. It won't surprise me if new languages take LLMs into account just like some languages now take the IDE experience into account.
Yes but also no. Writing source means rigorously specifying the implementation itself in deep detail. Most of the time, the implementation does not need to be specified with this sort of rigor. Instead the observable behavior needs to be specified rigorously.
I suppose all the money floating around AI helps dummify everything, as people glom on to narratives, regardless of merit, that might position them to partake.
What we actually have now is the ability to bang out decent quality code really fast and cheaply.
This is massive, a huge change, one which upends numerous assumptions about the business of software development.
...and it only leaves us to work through every other aspect of software development.
The approach this article advocates is to essentially pretend none of this exists. Simple, but will rarely produce anything of value.
This paragraph from the post gives you the gist of it:
> ...we need to remove humans-in-the-loop, reduce coordination, friction, bureaucracy, and gate-keeping. We need a virtually infinite supply of requirements, engineers acting as pseudo-product designers, owning entire streams of work, with the purview to make autonomous decisions. Rework is almost free so we shouldn’t make an effort to prevent incorrect work from happening.
As if the only reason we ever had POs or designers or business teams, or built consensus between multiple people, or communicated with others, or reviewed designs and code, or tested software, was because it took individual engineers too long to bang out decent code.
AI has just gotten people completely lost. Or I guess just made it apparent they were lost the whole time?
To me what AI is doing is changing the economics of human thought, but the change is happening way faster than individuals, let along organizations can absorb the implications. What I've seen is that AI magnifies the judgment of individuals who know how to use it, and so far it's mostly software engineers who have learned to use it most effectively because they are the ones able to develop an intuition about its limitations.
The idea of removing the human from the loop is nonsense. The question is more what loops matter, and how can AI speed them up. For instance, building more prototypes and one-off hacky tools is a great use of vibe coding, changing the core architecture of your critical business apps is not. AI has simultaneously increased my ability to call bullshit, while amplifying the amount of bullshit I have to sift through.
When the dust settles I don't really see that the value or importance of reading code has changed much. The whole reason agentic coding is successful is because code provides a precise specification that is both human and machine readable. The idea that we'll move from code to some new magical form of specification is just recycling the promise of COBOL, visual programming, Microsoft Access, ColdFusion, no-code tools, etc, to simplify programming. But actually the innovations that have moved the state of the art of professional programming forward, are the same ones that make agentic coding successful.
The point I’m making is that we give the spotlight to people who are making absurd claims. We have not achieved the ability to remove the human from the loop and continually produce value-able outputs. Until we do, I don’t see how any of the claims made in this article are even close to anything more than simply gate-keeping slop.
React team seems to really have set a precedent with their "dangerouslySetInnerHTML" idea.
Or did they borrow it somewhere?
I'm just curious about that etymology, of course the idea is not universally helpful: for example, for dd CLI parameters, it would only make a mess.
But when there's a flag/option that really requires you to be vigilant and undesired the input and output and all edge cases, calling it "dangerous" is quite a feat!
Technology, implementation may change, but general point of "why!?" stays.
I might even start my own blog to write about things I've found.
1. Always get the agent to create a plan file (spec). Whatever prompt you were going to yolo into the agent, do it in Plan Mode first so it creates a plan file.
2. Get agents to iterate on the plan file until it's complete and thorough. You want some sort of "/review-plan <file>" skill. You extend it over time so that the review output is better and better. For example, every finding should come with a recommended fix.
3. Once the plan is final, have an agent implement it.
4. Check the plan in with the impl commit.
The plan is the unit of work really since it encodes intent. Impl derives from it, and bugs then become a desync from intent or intent that was omitted. It's a nicer plane to work at.
From this extends more things: PRs should be plan files, not code. Impl is trivial. The hard part is the plan. The old way of deriving intent from code sucked. Why even PR code when we haven't agreed on a plan/intent?
This process also makes me think about how code implementation is just a more specific specification about what the computer should do. A plan is a higher level specification. A one-line prompt into an LLM is the highest level specification. It's kinda weird to think about.
Finally, this is why I don't have to read code anymore. Over time, my human review of the code unearthed fewer and fewer issues and corrections to the point where it felt unnecessary. I only read code these days so I can impose my preferences on it and get a feel for the system, but one day you realize that you can accumulate your preferences (like, use TDD and sum types) in your static prompt/instructions. And you're back to watching this thing write amazing code, often better than what you would have written unless you have maximum time + attention + energy + focus no matter how uninteresting the task, which you don't.
As I understand, this is an unsolved problem.
But that aside, it's such a shame that many drinking the AI Kool-Aid aren't even aware of the theoretical limits of a computer's capabilities.
Computers are finite machines. There is a theorem that although a machine with finite memory can add, multiplication requires unbounded memory. Somehow we muddle along and use computers for multiplication anyway.
More to your point there is a whole field of people who write useful programs using languages in which every program must be accompanied by a proof that it halts on all inputs.
(See for example https://lean-lang.org/ or David Turner's work on Total Functional Programming from about 20 years ago.)
Other examples are easy to find. The simplex algorithm for linear optimization requires exponential time in general, and the problem it solves is NP-hard, but in practice works well on problems of interest and is widely used. Or consider the dynamic programming algorithms for problems like subset-sum.
Theory is important, but engineering is also important.
What theorem is that?
The multiplication of any two integers below a certain size (called "words") fits in a "double word" and the naive multiplication algorithm needs to store the inputs, an accumulator and at most another temporary for a grand total of 6*word_size
Sure, you can technically "stream" carry-addition (which is obvious from the way adders are chained in ALU-101) and thus in a strict sense addition is O(1) memory but towards your final point:
> Theory is important, but engineering is also important.
In practice, addition requires unbounded memory as well (the inputs). And it's definitely compute-unbounded, if your inputs are unbounded.
I dislike the term "we muddle along". IEEE 754 has well specified error bars and cases, and so does all good data science. LLMs do not, or at least they do not expose them to the end user
So then, how exactly do we go about proving that the result of chaining prompts is within a controllable margin of error of the intended result? Because despite all the specs, numerical stability is the reason people don't write their own LAPACK.
LLMs address this problem by just making things up (and they don't do a great job of comprehending the natural language, either), which I think qualifies as "hoping for the best", but I'm not sure there is another way, unless you reframe the problem to allow the algorithm to request the information it's missing.
"Somewhat easier than an impossible task" is not a particularly strong claim about when (or whether) this problem will be solved, though.
Those were written by humans, and don't involve unsolved mathematics.
Is your claim tht you just need to solve comprehensibility of LLMs?
Figuring out epistemology and cognition to have a chance to reason about the outputs of a LLM seems to me way harder that traditional attempts to reason directly about algorithms.
"is this implementation/code actually aligned with what i want to do?"
humanic responsibility's focus will move entirely from implementing code to deciding whether it should be implemented or not.
u probably mean unsolved as in "not yet able to be automated", and that's true.
if pull-request checks verifying that tests are conforming to the spec are automated, then we'd have AGI.
Having the code-writing part automated would have a negligible impact on the total project time.
No, thank you
LLMs do not understand prose or code in the same way humans do (such that "understand" is misleading terminology), but they understand them in a way that's way closer to fuzzy natural language interpretation than pedantic programming language interpretation. (An LLM will be confused if you rename all the variables: a compiler won't even notice.)
So we've built a machine that makes the kinds of mistakes that humans struggle to spot, used RLHF to optimise it for persuasiveness, and now we're expecting humans to do a good job reviewing its output. And, per Kernighan's law:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?
And that's the ideal situation where you're the one who's written it: reading other people's code is generally harder than reading your own. So how do you expect to fare when you're reading nobody's code at all?
say: human wants to make a search engine that money for them.
1. for a task, ask several agents to make their own implementation and a super agent to evaluate each one and interrogate each agent and find the best implementation/variable names, and then explain to the human what exactly it does. or just mythos
2. the feature is something like "let videos be in search results, along with links"
3. human's job "is it worth putting videos in this search engine? will it really drive profits higher? i guess people will stay on teh search engine longer, but hmmm maybe not. maybe let's do some a/b testing and see whether it's worth implementing???" etc...
this is where the developer has to start thinking like a product manager. meaning his position is abolished and the product manager can do the "coding" part directly.
now this should be basic knowledge in 2026. i am just reading and writing back the same thing on HN omds.
LOL. I had to check if this was published on April 1st.
user experience/what the app actually does >>> actually implementing it.
elon musk said this a looong time ago. we move from layer 1 (coding, how do we implement this?) to layer 2 thinking (what should the code do? what do we code? should we implement this? (what to code to get the most money?))
this is basic knowledge
We need the pragmatic engineer more than ever.