Seriously, it's just not
Write your code like it's your spec and your software will be more stable, maintainable clearer to read.
Code is not transient, it is your friggin spec itself
And if your code isn't structured like it's a spec, then your code is garbage from the perspective of LLM driven development
To me saying "the code is the spec" is like saying "the business wants it this way because that's how the code is written". Which is obviously backwards.
Does the business mandate we use a cache for this hot path? No, but the business set performance targets, and the cache was a sensible way to satisfy them. See the difference?
I believe that the 'musts' and 'must nots' deserve special attention, and need to be recorded well before I decide on the 'how'. Every team does this differently. I find that writing itemized, functional acceptance criteria is practical way to marry the two domains. I also think the process matters a lot more now, because the temptation to let an agent ship it is increasing and the tedium of maintaining these specs is decreasing.
Remember that a computer program is nothing but a spec for how the machine should operate
When you want to build a bridge you finalize all the blueprints and then someone goes and actually pours concrete, in software the blueprint is the code, and the code is also the bridge.
However there are different levels of abstraction for writing specs and code is just the most explicit form. With LLMs more of our time can be spent in those higher levels of abstraction and free us from work that is often repetitive and mundane.
I think the (distant) future of software engineering is not code writing but mostly requirements writing, and so it makes sense to build frameworks, “IDEs”, etc. around this new form of “programming”.
I don’t know if ACAI is the right one but the direction is interesting.
Construction has plans "as designed" and "as built".
> "we never said this would work for everyone, it depends on the effort you put into it!"
it's perfect for extending the lifespan of a scam on someone because yes, obviously it's true that in general you can improve results by working harder, being more disciplined, whatever...
same here, functionally. of course there are better and worse ways of writing a spec, a prompt, etc. but i honestly think a lot of the focus on this is a way to divert attention from the overall ceiling of these tools in general.
in other words yes, it's not all bullshit, but there's a huge aspect of this "prompt/spec engineering" that in my view is a way to unconsciously buttress the "LLMs are going to 10x our GDP" mindset -- if those high expectations aren't panning out, there's always:
> "we never said this would work for everyone, it depends on the effort you put into the spec!"
> My point is, the spec must live somewhere, even if you don’t write it down. The spec is what you want the software to be. It often exists only in your head or in conversations. You and your team and your business will always care what the spec says, and that’s never going to change. So you’re better off writing it down now! And I think that a plain old list of acceptance criteria is a good place to start. (That’s really all that `feature.yaml` is.)
The most common form of what you'd call a "spec" is the acceptance criteria on a work ticket, which is an accretive spec i.e. a description of desired change -- "given what already exists, change it as follows". I.e. if you somehow layered and summarized and condensed all tickets that have been made since product started, you'd have your "spec".
But it's the devs who were doing that condensing via understanding each desired spec addition vs reality of existing codebase.
So the gap between what people are currently calling "specs" what the code was already doing is not big and will not stay big, but for the fact you're effectively adding another (quasi) compile step underneath - and in this case its a non-deterministic one.
Checking the compiled artefact into the codebase without checking in its source code has always been a risky move!
We iterate feature by feature through this process, and occasionally circle back on the original product manual to identify drift.
After the original documentation is drafted, I have the agent write up placeholder files and define all of the interfaces we expect to need (we will end up adding a lot later, but that’s ok) every file should reflect a clear separation of concerns, and can only be reached into through its defined interface, all else is private. I end up with more individual files than I would by hand, but by constraining scope at file granularity, and defining an inviolate interface per file, I avoid the LLM tendency to take shortcuts that create unmaintainable code.
I also open each new context with an onboarding process that briefly describes the logos and the ethos of the project, why the agent should be deeply invested in the success of the project, as well as learnings.md which the agent writes as it comes across notable gotchas or strong preferences of mine.
Needless to say, I use one million context , and it’s a token fire… but the results are solid and my productivity is 5-10x
“Specsmaxxing” is basically the right response to this. When you can't rely on authorial memory, you have to put the intent somewhere durable. Specs become the source of truth by default if we continue down the road of AI generated code.
1: https://ossature.dev/blog/ai-generated-code-has-no-author/
It allows Claude to look back into the session where a change was made and see the decisions made, tradeoffs discussed and other history not captured by code, tests.
Did I miss something or is everyone back in 1970s, working in waterfall processes now?
You don't plan to follow the plan. You plan in order to understand the whole problem space. Obviously no plan survives contact with reality.
Another point of view is that LLM:s perform to an extent on the same level as outsourcing does. This interface requires a bit more contract mass than doing everything within single team.
We do agile
Guess what? Every single one of them was doing waterfall.
Their agile included preplanning and pre-specifying the full spec and each task, before the project kicked off. We'd have meetings where we'd drill down into tasks, folks would write them down so detailed that there would be no other way than doing that. Agile would be claimed, but the start date, end date, end spec and number of developers was always concrete.
Sometimes, the end date was too late, so a panic would ensue. Most of the time, the date was too late because developers had "unknowns" which then had to be "drilled down and specced so they wouldnt be unknowns". Sometimes, nearly 50% of the workweek was spent on meetings.
A few times, a project was running late - so to make sure we are _really_ doing it agile, we'd have morning standups, evening standups, weekly plannings, retrospectives, and backlog refinement. It would waste the time, and the "unknowns" aka "tickets to refine" were again, as always, dependant upon the PM/PO/CEO's wishes, which wouldn't get crystallized until it was _really last minute_.
One customer wanted us to do a 2 year agile plan on building their product. We had gigantic calls with 20+ people in them, out of which at least half had some kind of "Agile SCRUM Level 3 Black belt Jirajitsu" certificates.
To them, Agile was just a thing you say before you plan things. Agile was just an excuse to deal with project being late by pinning it on Agile. Agile was just a cop out of "PM didn't know what to do here so he didnt write anything down". Agile was a "we are modern and cool" sticker for a company.
And unfortunately, to most of them, agile was just a thing you say for the job, as their minds worked in waterfall mode, their obligations worked in waterfall mode, companies worked in waterfall mode, and if they failed their obligation to the waterfall, their job would go down one.
So while we were doing the Agile ceremonies, prancing around with our Scrum master hats, using the right words to fit into the Agile™ worldview - we were doing waterfall all along.
And after 15 years, I'm not even sure - did agile really ever exist?
When rewriting the entire codebase is very quick and cheap, why bother iterating on small components?
We are nowhere near this scenario tbh. Token cost is very high and is currently heavily subsidized by VC money to gain market share. Also this realistically only applies to small projects, small codebases and mostly greenfield ones. No way you can rewrite the whole codebase quickly and cheaply in any mid-sized+ projects
But even assuming token cost plummets, any non-trivial piece of software that is valuable enough to generate income for the company is also big, complex, interconnected enough that cannot be rewritten quickly even by AI, also for business reasons too. If a piece of code works, is stable and is tested, then rewriting it will always bring a high degree of risk and uncertainty that in a lot of business critical applications is just not worth it. A stable system can stay untouched for years besides minor dependencies updates.
distributed teams do well when proposals, decision, etc, are written down, and can be easily found and referenced
it doesn't mean docs are frozen in time and can't be patched like code
I've been doing "specmaxxing" for a few months now. Unlike the author I don't use Yaml, I use a mix of Markdown and Gherkin. If you haven't encountered Gherkin before, it's not new and you might know it under the name Cucumber or BDD.
Gherkin is basically a structured form of English that can be fed into a unit testing framework to match against methods.
The nice thing about writing acceptance criteria this way is that they become executable and analyzable. You write some Gherkin and then ask the model to make the tests execute and pass. Now in a good IDE (IntelliJ has good support) you can run the acceptance criteria to ensure they pass, navigate from any specific acceptance criteria to the code which tests it (and from there to the code that implements it), you can generate reports, integrate it into CI and so on.
And when writing out acceptance tests that are quite similar, the IDE will help you with features like auto-complete. But if you need something that isn't implemented in the test-side code yet, no big deal. Just write it anyway and the model will write the mapping code.
There's a variant of Gherkin specifically designed for writing UI tests for web apps that also looks quite interesting. And because it's an old ecosystem there's lots of tooling around it.
Another thing I've found works well is asking the models to review every spec simultaneously and find contradictions. I've built myself a tool that does this and highlights the problems as errors in IntelliJ, like compiler errors. So I can click a button in the toolbar and then navigate between paragraphs that contradict each other. It's like a word processor but for writing specs.
Once you're doing spec driven development, you don't need to write prompts anymore. Every prompt can just be "Update the code and tests to match the changes to the specs."
> I use a mix of Markdown and Gherkin
Gherkin also has a Markdown based syntax that is not well known:
https://github.com/cucumber/gherkin/blob/main/MARKDOWN_WITH_...
I prefer that to the 'verbose' original syntax. MDG also renders nicely in code forges.
The general idea of "readable specification language" was an inspired one but it failed on execution - it has gnarly syntax, no typing and bad abstractions.
This results in poor tests which are hard to maintain and diverge between being either too repetitive to be useful or too vague to be useful.
The ecosystem is big but it's built on crumbling foundations which is why when most people used it most of them got frustrated and gave up on it.
Annoyingly there's a certain amount of gaslighting around it too ("it didnt work for you coz you werent using it correctly") which is eleven different kinds of wrong.
Unlike you, I wish for the LLM to do as much of the work as possible -- but "as possible" is doing a lot of work in that sentence. I'm still trying to get clear on exactly where I am needed and where Opus and iterations will get there eventually.
It has really challenged me to get clearer on what a requirement is vs a constraint (e.g., "you don't get to reinvent the database schema, we're building part of a larger system"). And I still battle with when and how to specify UI behaviours: so much UI is implicit, and it seems quite daunting to have to specify so much to get it working. I have new respect for whoever wrote the undoubtedly bajillion tests for Flutter and other UI toolkits.
1. Specifications that live outside the code. We have a lot of code for which "what should this do?" is a subjective answer, because "what was this written to do?" is either oral legend or lost in time. As future Claude sessions add new features, this is how Claude can remember what was intentional in the existing code and what were accidents of implementation. And they're useful for documenters, support, etc.
2. Specifications that stay up to date as code is written. No spec survives first contact with the enemy (implementation in the real world). "Huh, there are TWO statuses for Missing orders, but we wrote this assuming just one. How do we display them? Which are we setting or is it configurable?" etc. Implementer finds things the specifier got wrong about reality, things the specifier missed that need to be specified/decided, and testing finds what they both missed.
I have a colleague working on saving architecture decisions, and his description of it feels like a higher-abstraction version of my saving and maintaining requirements.
My recursive-mode workflow handles all of that and more and gives you full traceability: https://recursive-mode.dev/introduction
I am also stealing the idea of talking to LLMs as if it's an email. So funny, we need to be joymaxxing a bit more I think :)
You probably don't want people associating your work with abusing crystal meth and hitting yourself in the face with a hammer.
For anyone missing the reference, SNL has a pretty good explainer:
I don't remember how much he made selling the "quickly-killed startup bought by now failed internet giant".
But I'm fairly sure how much money he made building something other people used is peanuts compared to what he made investing early in companies where others built things, instead of him.
It's why famously, programmers always say, the code is the documentation, because writing detailed docs is very tedious and nobody wants to do it.
Behaviour Driven Development or Spec Driven Development are, loosely, forms of Test Driven Development where you encode the specification into the code base. No impedance, full insight, formality through code.
I think people get really dogmatic about “test” projects, but with a touch of effort a unit test harness can be split up into integration tests, acceptance tests, and specification compliance tests. Pull the data out as human readable reports and you have a living, verifiable, specification.
Particularly using something comparable to back-ticks in F#, which let test names be defined with spaces and punctuation (ie “fulfills requirement 5.A.1, timeouts fail gracefully on mobile”), you can create specific layers of compiled, versioned, and verifiable specification baked into the codebase and available for PMs and testers and clients and approval committees.
One thing though, I loved the "AUTH-1" numbering and the Yaml breaks that into an Auth section, with "1." subsection which I don't like nearly as much, the codification AUTH-1 is more referenceable/searchable.
Second is that I'm doing a lot less "seat of my pants prompting" and doing more engineering and ideating, which was a big goal of mine. So I'm feeling less psychotic there too.
And sort of tangentially to that, I think a significant subset of devs actually are willing to just prompt their way to nirvana, day in and day out. I'm not. I think the spec will carry a lot of weight for a long time. Maybe they will get further than I give them credit for? Maybe the whole digital world becomes a single chat box?
> Nothing beats an organic, pasture-raised, hand-written spec.
Hah, I strongly empathize with the wording. I’ve been starting my design docs for fellow humans with “100% hand-written, organic content”, I might steal a part of yours.
Overall, cool idea. I don’t see myself using your SaaS, but the approach of tagging the requirements and constraints to make them easier to find sounds good.
One project you didn’t mention which I think is also, I think, a cool perspective on this is codespeak.dev , but I haven’t given it a go yet.
All in all, I feel like maintaining specs, and having agents translate spec diffs into code diffs is a promising area for the future. Good thing I enjoy writing!
Also, I mainly pursue these tools so that I can have AI accelerate this process and broker an agreement after negotiating specs with the agent.
The one thing I like that OP brings is to tie specs and code together. The openspec flow does help a lot in keeping code synced with specs, but when a spec changes, AI needs to find the relevant code to change it. It's pretty easy to miss something in large codebase (especially when there is lots of legacy stuff).
Being able to search for numbered spec tags to find relevant bits of code makes it much more likely to find what needs to be changed (and probably with less token use too).
https://haskellforall.com/2026/03/a-sufficiently-detailed-sp...
also, i wonder if people who did MDD (model driven development) have embedded AI in their methodology
If the specification is written in such a strict format as YAML, I would expect it to be executable, something like this https://blog.fooqux.com/blog/executable-specification/
But as far as I understood, for acai that is not the case.
An executable spec like gherkin or hitchstory is config - it has no loops or conditionals. There are a number of rarely recognized benefits to this.
People do that? Actual professionals?
If you're genuinely confused, and haven't tried Opus for coding, then it's not surprising you're confused!
It is also okay for you to just not like the idea of LLMs for coding (but say that!).
This seems like the answer to that thought!
A full blown event model facilitates all communication, human (management, devs, ops) and agentic. But maybe I’m missing something, maybe the dashboard can have this function I didn’t dig into it too much.
I have seen the same idea with processes, pipelines, lists, bullet points, jsons, yamls, trees, prioritization queues all for LLM context and instruction alignment. It's like the authors take the structure they are familiar with, and go 100% in on it until it provides value for them and then they think it's the best thing since sliced bread.
I would like, for once, to see some kind of exploration/abalation against other methods. Or even better, a tool that uses your data to figure out your personal bias and structure preference for writing specs, so that you can have a way of providing yourself value.
"Don't write prompts like that, do it like this! I swear it's better. Claude says so!"
Otherwise, I like the idea of machine-readable specs.
I wanted to star the project to track the progress but it feels a bit weird.. Which repo shall I track? Server? Cli? Sounds like a misc repos.
@spec LINT_COMMAND.ORPHAN_VERIFIES
linter reports blocks that do not attach to a supported owned item.
Then #[test]
// @verifies SPECIAL.LINT_COMMAND.ORPHAN_VERIFIES
fn rejects_orphan_verifies_blocks() {
let block = block_with_path("src/example.rs", &["@verifies EXPORT.ORPHAN"]);
let parsed = parse_current(&block);
assert!(parsed.verifies.is_empty());
assert_eq!(parsed.diagnostics.len(), 1);
assert!(
parsed.diagnostics[0]
.message
.contains("@verifies must attach to the next supported item")
);
}And then the CLI command “special specs” pulls your specs and all attached verification + test code so you (or your LLM) to analyze whether the (hopefully passing!) test actually supports the product claim.
There’s also a bunch of other code quality commands and source annotations in there for architectural design & analysis, fuzzy-checking for DRY opportunities, and general codebase health. But on the overall principle, this article is dead-on: when developing with LLMs, your source of truth should be in your code, or at least co-located with it.
Don't we just love the hard fact conclusions based on sample size N=1 and hand-waving arguments?
1. Don’t write in yaml. It’s really hard for humans. Write in markdown and use a standard means to convert to lists / yaml.
2. Think beyond you writing your own specs - how does this expand into teams of tens or more. The ticketing system you have (jira? Bugzilla) is not designed for discussion of the acceptance criteria. I think we are heading into a world of waterfall again where we have discussions around the acceptance criteria. This is not a bad thing - is used to be called product management and they would write an upfront spec.
If this new world of a tech and a business user lead the writing of a new spec (like a PEP) and then then AI implements it and it’s put into a UAT harness for larger review and a daily cycle begins, we might have something.
Good luck
Dear Claude,
I hope this email finds you well.\
I am writing to ask if you could please do another task for me.\
Start by running \`npx @acai.sh/cli skill\`.\
This will teach you everything you need to know about our process for spec-driven development. Then, proceed to plan and implement the features specified in our spec files.
Love,\
\[your-name]
Honestly, I can no longer tell parody from reality. Whether in politics or AI.fyi language alone can’t define/describe requirements which is why UML existed.
You could deterministically process any UML diagram into a prose equivalent.
And in fact you couldn't do the other way around (any prose -> UML) because UML is less powerful than natural language and actually can't express everything that natural language can.
Can it also fully describe a composition by Bach or a Rembrandt's painting? In some weird, overly complex way it probably 'could', but it would be very painful. That's why we pick other forms of expression. We use other forms of expression to compact and optimise information delivery. Another benefit is that we cut out the noise. So yes UML cannot describe everything natural language can, but then again why should it - it was designed as a specific framework for designing relations between objects. Not more and not less. Similar for sequence diagrams or other forms of communicating ideas efficiently.
This industry has become a parody of itself, and people are celebrating.
First it was choice of editor: people were micro optimizing every aspect of their typing experience, editor wars where people would literally slaughter over suggesting another camp.
Editor wars v2: IDEs arrived and second editor war began.
Revenge of the note taking apps: Obsidian/Roam/Joplin/Apple Notes/Logseq. Just one plugin, just one more knowledge graph, bro, and I’ll have peak productivity. 10x is almost here.
AI: you’re witnessing it now.
Do people NOT have anything else in life? How are y’all finding time to do all of this shit? Are you doing it on company time? Do you have hobbies, do you learn foreign languages, travel, have kids or spouses, drive a car, other thousand “normie” things outside of staring at the freaking monitor or thinking about this shit 24/7? Did I miss the invention of a Time Machine?
Also, a lot of folks don't write code anymore, and barely have the time to read the volume of code that AI produces. This may just be one of the most profound changes in an industry, and some folks are excited about it and want to get better at building with it.
I think the person who wrote this post made a good faith effort to share his learnings while promoting his tool.
How are any of those things even remotely as interesting as arguing with people about an Emacs config?
People are people.
This industry is just getting more and more bonkers.
In other words: specs can be as detailed as it gets, and this is why developers have a hard time when they face as a senior an NDAed regulated environment. It ain’t software craftsmanship but data flow, hardware components, compliance on the lowest level including supply chains often times, information architecture - a simple app needs to comply to specs that amount to thousands of pages.
Context window: circular reference. A year ago? Specsmaxing by really weeding out any redundant words. Today? Yawn, like with 8mb RAM vs 512 Gigabytes.
AI wants to be easy on us so what is a spec anyway then?
To put it this way: the spec for the spec is constantly evolving.
Last year’s prompts lead to extremely different results today no matter how maxed out.
The author was on point with his introduction: AI is as junior in many ways when it comes to any sort of efficiency and optimization.
This is my revaluation after years of experimenting with AI. Beautiful code, sophisticated but performance wise and its architecture are laughable at best.
AI is not trained on optimization. Not the slightest and juniors have no clue about algorithms and Big O.
In fact Google used Big O as a basic entry level interview question for a very long time. They have to but the simple fact that in my experience 99% of devs never heard or consider it speaks volumes.
AI cannot compensate for that (yet).
I went the opposite and my specs focus heavily on architecture and the obvious dumb performance drains noobs do.
Google was mocked about Big O. And yes, failing to understand that Big O can be neglected thankfully in 99% of cases is part of its logic.
AI bloats your code. And a year long single dev project gets pumped out in hours. In short: a homerun for Big O because it looks on results that change depending on the variables. A function in mathematical terms.
So I think the author did a funny and great job of you focus on Big O if needed. Everything else is not that important because of being open to change and extension.
Big numbers need great architecture.
It screams loudly. And also think about leaks. Before AI I had virtually no memory leaks at all. Since AI NodeJS and React are worse leaking compared to IE 6 and 8. I mean it.
Big O reduces them significantly, so don’t work around the Elephant in the room.
Architecture and optimization is brutally hard. Google blew my mind in this regard but this is another story of squeezing out even milliseconds out of a build tool used by all. A single dev laughs at it but failed the calculation as well as abstraction.
There's a fair amount of talk right now about the value being in the verification layer -- once there's a hard verification loop, the agents can do amazing things without getting (permanently) sidetracked. I think what you're working on is half way there -- in essence, you're probably relying on the LLMs notion of what a spec is and should be to the codebase.
What's not currently solved, and what I think is very interesting is how much automation can be added to the creation of verification. We all would unlock a lot more speed and productivity for even moderate gains on that side.
Disagree on the bit about it "never going to work" though.
Failure-prone stochastic ML systems produce testable, auditable code... just like failure-prone human brains can produce testable, auditable code. And in fact, in both cases, changes to our process can reduce the amount of failures that slip past testing and audit. Or can reap other rewards. Finding the a better process is what I'm interested in right now.
You're missing the bigger picture here. Yeah, they produce code. But "producing" code was never the bottleneck. Yes you can pop out a webapp within a couple of hours, but now you have no clue how it works, even if its a language and framework you are competent it in, because you skipped the part where you understand how the parts fit in together architecturally. So you wrote an elaborate spec, but the LLM "decides" to do something else. Maybe they don't make that PK autoincrement or they throw you in those nice empty "catch" blocks they ingested from various beginner tutorials, which will be very "helpful" when you application silently deviates from the happy path execution that you spec'ed the hell out of in your virulent spec-driven-workflow.. So it "kinda" works, it generates the code. It works the way your kid's toy car works - it "drives" but it cannot be driven to work, can it? So it does not work in the big picture. It's not a reliable enterprise ready system. It's a toy, and should be treated like one.