How I use Claude Code: Separation of planning and execution

chaboud · 2026-02-22T07:26:49 1771745209

The author seems to think they've hit upon something revolutionary...

They've actually hit upon something that several of us have evolved to naturally.

LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.

So, how do you solve that? Exactly how an experienced lead or software manager does: you have systems write it down before executing, explain things back to you, and ground all of their thinking in the code and documentation, avoiding making assumptions about code after superficial review.

When it was early ChatGPT, this meant function-level thinking and clearly described jobs. When it was Cline it meant cline rules files that forced writing architecture.md files and vibe-code.log histories, demanding grounding in research and code reading.

Maybe nine months ago, another engineer said two things to me, less than a day apart:

- "I don't understand why your clinerules file is so large. You have the LLM jumping through so many hoops and doing so much extra work. It's crazy."

- The next morning: "It's basically like a lottery. I can't get the LLM to generate what I want reliably. I just have to settle for whatever it comes up with and then try again."

These systems have to deal with minimal context, ambiguous guidance, and extreme isolation. Operate with a little empathy for the energetic interns, and they'll uncork levels of output worth fighting for. We're Software Managers now. For some of us, that's working out great.

vishnugupta · 2026-02-22T08:18:52 1771748332

Revolutionary or not it was very nice of the author to make time and effort to share throat workflow.

For those starting out using Claude Code it gives a structured way to get things done bypassing the time/energy needed to “hit upon something that several of us have evolved to naturally”.

CodeBit26 · 2026-02-22T08:19:52 1771748392

I really like your analogy of LLMs as 'unreliable interns'. The shift from being a 'coder' to a 'software manager' who enforces documentation and grounding is the only way to scale these tools. Without an architecture.md or similar grounding, the context drift eventually makes the AI-generated code a liability rather than an asset. It's about moving the complexity from the syntax to the specification.

marc_g · 2026-02-22T08:07:52 1771747672

I’ve also found that a bigger focus on expanding my agents.md as the project rolls on has led to less headaches overall and more consistency (non-surprisingly). It’s the same as asking juniors to reflect on the work they’ve completed and to document important things that can help them in the future. Software Manger is a good way to put this.

jeffreygoesto · 2026-02-22T08:11:03 1771747863

Oh no, maybe the V-Model was right all the time? And right sizing increments with control stops after them. No wonder these matrix multiplications start to behave like humans, that is what we wanted them to do.

w10-1 · 2026-02-22T08:22:00 1771748520

I try these staging-document patterns, but suspect they have 2 fundamental flaws that stem mostly from our own biases.

First, Claude evolves. The original post work pattern evolved over 9 months, before claude's recent step changes. It's likely claude's present plan mode is better than this workaround, but if you stick to the workaround, you'd never know.

Second, the staging docs that represent some context - whether a library skills or current session design and implementation plans - are not the model Claude works with. At best they are shaping it, but I've found it does ignore and forget even what's written (even when I shout with emphasis), and the overall session influences the code. (Most often this happens when a peripheral adjustment ends up populating half the context.)

Indeed the biggest benefit from the OP might be to squeeze within 1 session, omitting peripheral features and investigations at the plan stage. So the mechanism of action might be the combination of getting our own plan clear and avoiding confusing excursions. (A test for that would be to redo the session with the final plan and implementation, to see if the iteration process itself is shaping the model.)

Our bias is to believe that we're getting better at managing this thing, and that we can control and direct it. It's uncomfortable to realize you can only really influence it - much like giving direction to a junior, but they can still go off track. And even if you found a pattern that works, it might work for reasons you're not understanding -- and thus fail you eventually. So, yes, try some patterns, but always hang on to the newbie senses of wonder and terror that make you curious, alert, and experimental.

mvkel · 2026-02-22T06:53:17 1771743197

> the workflow I’ve settled into is radically different from what most people do with AI coding tools

This looks exactly like what anthropic recommends as the best practice for using Claude Code. Textbook.

It also exposes a major downside of this approach: if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.

I've found a much better approach in doing a design -> plan -> execute in batches, where the plan is no more than 1,500 lines, used as a proxy for complexity.

My 30,000 LOC app has about 100,000 lines of plan behind it. Can't build something that big as a one-shot.

onion2k · 2026-02-22T07:25:41 1771745141

if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong

This is my experience too, but it's pushed me to make much smaller plans and to commit things to a feature branch far more atomically so I can revert a step to the previous commit, or bin the entire feature by going back to main. I do this far more now than I ever did when I was writing the code by hand.

This is how developers should work regardless of how the code is being developed. I think this is a small but very real way AI has actually made me a better developer (unless I stop doing it when I don't use AI... not tried that yet.)

AstroBen · 2026-02-22T07:54:25 1771746865

100,000 lines is approx. one million words. The average person reads at 250wpm. The entire thing would take 66 hours just to read, assuming you were approaching it like a fiction book, not thinking anything over

dakolli · 2026-02-22T07:36:35 1771745795

wtf, why would you write 100k lines of plan to produce 30k loc.. JUST WRITE THE CODE!!!

Bishonen88 · 2026-02-22T07:45:17 1771746317

They didn't write 100k plan lines. The llm did (99.9% of it at least or more). Writing 30k by hand would take weeks if not months. Llms do it in an afternoon.

AstroBen · 2026-02-22T08:01:21 1771747281

Just reading that plan would take weeks or months

Bishonen88 · 2026-02-22T07:07:17 1771744037

Dunno. My 80k+ LOC personal life planner, with a native android app, eink display view still one shots most features/bugs I encounter. I just open a new instance let it know what I want and 5min later it's done.

makeramen · 2026-02-22T07:11:14 1771744274

Both can be true. I have personally experienced both.

Some problems AI surprised me immensely with fast, elegant efficient solutions and problem solving. I've also experienced AI doing totally absurd things that ended up taking multiple times longer than if I did it manually. Sometimes in the same project.

therealdrag0 · 2026-02-22T07:35:12 1771745712

In 5 min you are one shotting smaller changes to the larger code base right? Not the entire 80k likes which was the other comments point afaict.

Bishonen88 · 2026-02-22T07:52:59 1771746779

Yeah, then I guess I misunderstood the post. Its smaller features one by one ofc.

vasco · 2026-02-22T07:19:37 1771744777

What is a personal life planner?

Bishonen88 · 2026-02-22T07:26:26 1771745186

Todos, habits, goals, calendar, meals, notes, bookmarks, shopping lists, finances. More or less that with Google cal integration, garmin Integration (Auto updates workout habits, weight goals) family sharing/gamification, daily/weekly reviews, ai summaries and more. All built by just prompting Claude for feature after feature, with me writing 0 lines.

vasco · 2026-02-22T08:18:48 1771748328

Ah, I imagined actual life planning as in asking AI what to do, I was morbidly curious.

Prompting basic notes apps is not as exciting but I can see how people who care about that also care about it being exactly a certain way, so I think get your excitement.

puchatek · 2026-02-22T07:35:42 1771745742

Is it on GH?

Bishonen88 · 2026-02-22T07:49:44 1771746584

It was when I mvp'd it 3 weeks ago. Then I removed it as I was toying with the idea of somehow monetizing it. Then I added a few features which would make monetization impossible (e.g. How the app obtains etf/stock prices live and some other things). I reckon I could remove those and put in gh during the week if I don't forget. The quality of the Web app is SaaS grade IMO. Keyboard shortcuts, cmd+k, natural language parsing, great ui that doesn't look like made by ai in 5min. Might post here the link.

haolez · 2026-02-22T01:25:29 1771723529

> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.

This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.

nostrademons · 2026-02-22T02:48:53 1771728533

It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying "Actually, if you go into the details of..." or "If you look at the intricacies of the problem" or "If you understood the problem deeply" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.

Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.

manmal · 2026-02-22T06:03:02 1771740182

I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.

Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.

spagettnet · 2026-02-22T06:11:38 1771740698

Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.

xscott · 2026-02-22T04:31:55 1771734715

Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.

Just a theory.

victorbjorklund · 2026-02-22T05:01:04 1771736464

Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.

So if you send a python code then the first one in function can be one expert, second another expert and so on.

dotancohen · 2026-02-22T07:36:42 1771745802

Can you back this up with documentation? I don't believe that this is the case.

hbarka · 2026-02-22T07:47:17 1771746437

>> Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts.

This pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.

A brief purpose statement describing what the skill [skill.md] does is more honest and just as effective.

r0b05 · 2026-02-22T03:58:56 1771732736

This is such a good explanation. Thanks

FuckButtons · 2026-02-22T04:49:40 1771735780

That’s because it’s superstition.

Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.

Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.

The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.

If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

majormajor · 2026-02-22T05:39:35 1771738775

> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."

So I'm 90% sure this is already happening on some level.

stingraycharles · 2026-02-22T07:46:51 1771746411

I actually have a prompt optimizer skill that does exactly this.

https://github.com/solatis/claude-config

It’s based entirely off academic research, and a LOT of research has been done in this area.

One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.

“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”

https://arxiv.org/abs/2307.11760

onion2k · 2026-02-22T07:30:57 1771745457

i suppose we will just have to write an English to pedantry compiler.

A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.

rzmmm · 2026-02-22T06:35:12 1771742112

I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".

imiric · 2026-02-22T07:14:59 1771744499

> That’s because it’s superstition.

This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.

This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.

It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]

[1]: https://news.ycombinator.com/item?id=47034087

jcdavis · 2026-02-22T01:37:23 1771724243

Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.

Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running

(1 Outside of some core ML developers at the big model companies)

harrall · 2026-02-22T03:38:31 1771731511

It’s like playing a fretless instrument to me.

Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.

Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.

klipt · 2026-02-22T04:18:05 1771733885

Sufficiently advanced technology has become like magic: you have to prompt the electronic genie with the right words or it will twist your wishes.

silversmith · 2026-02-22T06:07:02 1771740422

Light some incense, and you too can be a dystopian space tech support, today! Praise Omnissiah!

overfeed · 2026-02-22T07:08:00 1771744080

are we the orks?

chickensong · 2026-02-22T02:12:40 1771726360

For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions.

glerk · 2026-02-22T04:03:13 1771732993

Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less.

whateveracct · 2026-02-22T06:25:15 1771741515

this is psychotic why is this how this works lol

joshmn · 2026-02-22T02:56:39 1771728999

Sometimes I daydream about people screaming at their LLM as if it was a TV they were playing video games on.

trueno · 2026-02-22T02:40:10 1771728010

wait seriously? lmfao

thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.

if there's a source for that i'd love to read about it.

chickensong · 2026-02-22T06:12:34 1771740754

I don't have a source offhand, but I think it may have been part of the 4.5 release? Older models definitely needed caps and words like critical, important, never, etc... but Anthropic published something that said don't do that anymore.

basch · 2026-02-22T06:03:51 1771740231

If you think about where in the training data there is positivity vs negativity it really becomes equivalent to having a positive or negative mindset regarding a standing and outcome in life.

whateveracct · 2026-02-22T06:25:43 1771741543

i make claude grovel at my feet and tell me in detail why my code is better than its code

xmcp123 · 2026-02-22T03:29:15 1771730955

For awhile(maybe a year ago?) it seemed like verbal abuse was the best way to make Claude pay attention. In my head, it was impacting how important it deemed the instruction. And it definitely did seem that way.

defrost · 2026-02-22T02:49:08 1771728548

Consciousness is off the table but they absolutely respond to environmental stimulus and vibes.

See, uhhh, https://pmc.ncbi.nlm.nih.gov/articles/PMC8052213/ and maybe have a shot at running claude while playing Enya albums on loop.

/s (??)

trueno · 2026-02-22T03:48:59 1771732139

i have like the faintest vague thread of "maybe this actually checks out" in a way that has shit all to do with consciousness

sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?

in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.

all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.

these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no.

hashmap · 2026-02-22T01:57:09 1771725429

these sort-of-lies might help:

think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.

caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.

if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.

or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough

noduerme · 2026-02-22T03:57:57 1771732677

Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.

The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.

basch · 2026-02-22T06:07:47 1771740467

My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.

scuff3d · 2026-02-22T03:20:50 1771730450

How anybody can read stuff like this and still take all this seriously is beyond me. This is becoming the engineering equivalent of astrology.

energy123 · 2026-02-22T07:13:28 1771744408

Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...

It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).

It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.

fragmede · 2026-02-22T03:28:58 1771730938

Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!

scuff3d · 2026-02-22T04:29:55 1771734595

That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.

Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.

yaku_brang_ja · 2026-02-22T06:28:30 1771741710

These coding agents are literally Language Models. The way you structure your prompting language affect the actual output.

guiambros · 2026-02-22T05:52:17 1771739537

If you read the transformer paper, or get any book on NLP, you will see that this is not magic incantation; it's purely the attention mechanism at work. Or you can just ask Gemini or Claude why these prompts work.

But I get the impression from your comment that you have a fixed idea, and you're not really interested in understanding how or why it works.

If you think like a hammer, everything will look like a nail.

scuff3d · 2026-02-22T06:34:20 1771742060

I know why it works, to varying and unmeasurable degrees of success. Just like if I poke a bull with a sharp stick, I know it's gonna get it's attention. It might choose to run away from me in one of any number of directions, or it might decide to turn around and gore me to death. I can't answer that question with any certainty then you can.

The system is inherently non-deterministic. Just because you can guide it a bit, doesn't mean you can predict outcomes.

guiambros · 2026-02-22T07:36:07 1771745767

> The system is inherently non-deterministic.

The system isn't randomly non-deterministic; it is statistically probabilistic.

The next-token prediction and the attention mechanism is actually a rigorous deterministic mathematical process. The variation in output comes from how we sample from that curve, and the temperature used to calibrate the model. Because the underlying probabilities are mathematically calculated, the system's behavior remains highly predictable within statistical bounds.

Yes, it's a departure from the fully deterministic systems we're used to. But that's not different than the many real world systems: weather, biology, robotics, quantum mechanics. Even the computer you're reading this right now is full of probabilistic processes, abstracted away through sigmoid-like functions that push the extremes to 0s and 1s.

winrid · 2026-02-22T06:43:04 1771742584

But we can predict the outcomes, though. That's what we're saying, and it's true. Maybe not 100% of the time, but maybe it helps a significant amount of the time and that's what matters.

Is it engineering? Maybe not. But neither is knowing how to talk to junior developers so they're productive and don't feel bad. The engineering is at other levels.

tokioyoyo · 2026-02-22T05:18:31 1771737511

Do you actively use LLMs to do semi-complex coding work? Because if not, it will sound mumbo-jumbo to you. Everyone else can nod along and read on, as they’ve experienced all of it first hand.

scuff3d · 2026-02-22T06:03:56 1771740236

You've missed the point. This isn't engineering, it's gambling.

You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).

That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.

gf000 · 2026-02-22T06:46:37 1771742797

> You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time

This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.

Betelbuddy · 2026-02-22T02:27:25 1771727245

Its very logical and pretty obvious when you do code generation. If you ask the same model, to generate code by starting with:

- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...

You will notice a clear improvement in the quality of the generated artifacts.

obiefernandez · 2026-02-22T02:43:21 1771728201

My colleague swears by his DHH claude skill https://danieltenner.com/dhh-is-immortal-and-costs-200-m/

haolez · 2026-02-22T03:01:19 1771729279

That's different. You are pulling the model, semantically, closer to the problem domain you want it to attack.

That's very different from "think deeper". I'm just curious about this case in specific :)

argee · 2026-02-22T05:35:36 1771738536

I don't know about some of those "incantations", but it's pretty clear that an LLM can respond to "generate twenty sentences" vs. "generate one word". That means you can indeed coax it into more verbosity ("in great detail"), and that can help align the output by having more relevant context (inserting irrelevant context or something entirely improbable into LLM output and forcing it to continue from there makes it clear how detrimental that can be).

Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.

computomatic · 2026-02-22T06:06:07 1771740367

If I say “you are our domain expert for X, plan this task out in great detail” to a human engineer when delegating a task, 9 times out of 10 they will do a more thorough job. It’s not that this is voodoo that unlocks some secret part of their brain. It simply establishes my expectations and they act accordingly.

To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.

giancarlostoro · 2026-02-22T03:26:39 1771730799

The LLM will do what you ask it to unless you don't get nuanced about it. Myself and others have noticed that LLM's work better when your codebase is not full of code smells like massive godclass files, if your codebase is discrete and broken up in a way that makes sense, and fits in your head, it will fit in the models head.

computerex · 2026-02-22T05:34:44 1771738484

It is as the author said, it'll skim the content unless otherwise prompted to do so. It can read partial file fragments; it can emit commands to search for patterns in the files. As opposed to carefully reading each file and reasoning through the implementation. By asking it to go through in detail you are telling it to not take shortcuts and actually read the actual code in full.

ambicapter · 2026-02-22T03:17:07 1771730227

Maybe the training data that included the words like "skim" also provided shallower analysis than training that was close to the words "in great detail", so the LLM is just reproducing those respective words distribution when prompted with directions to do either.

winwang · 2026-02-22T03:32:24 1771731144

Apparently LLM quality is sensitive to emotional stimuli?

"Large Language Models Understand and Can be Enhanced by Emotional Stimuli": https://arxiv.org/abs/2307.11760

Affric · 2026-02-22T04:56:17 1771736177

My guess would be that there’s a greater absolute magnitude of the vectors to get to the same point in the knowledge model.

stingraycharles · 2026-02-22T02:07:11 1771726031

It’s actually really common. If you look at Claude Code’s own system prompts written by Anthropic, they’re littered with “CRITICAL (RULE 0):” type of statements, and other similar prompting styles.

Scrapemist · 2026-02-22T04:31:04 1771734664

Where can I find those?

ChadNauseam · 2026-02-22T01:52:37 1771725157

The disconnect might be that there is a separation between "generating the final answer for the user" and "researching/thinking to get information needed for that answer". Saying "deeply" prompts it to read more of the file (as in, actually use the `read` tool to grab more parts of the file into context), and generate more "thinking" tokens (as in, tokens that are not shown to the user but that the model writes to refine its thoughts and improve the quality of its answer).

wrs · 2026-02-22T05:34:56 1771738496

The original “chain of thought” breakthrough was literally to insert words like “Wait” and “Let’s think step by step”.

DemocracyFTW2 · 2026-02-22T07:21:30 1771744890

—HAL, open the shuttle bay doors.

(chirp)

—HAL, please open the shuttle bay doors.

(pause)

—HAL!

—I'm afraid I can't do that, Dave.

wilkystyle · 2026-02-22T01:44:32 1771724672

The author is referring to how the framing of your prompt informs the attention mechanism. You are essentially hinting to the attention mechanism that the function's implementation details have important context as well.

joseangel_sc · 2026-02-22T06:43:25 1771742605

if it’s so smart, why do i need to learn to use it?

fragmede · 2026-02-22T01:30:57 1771723857

Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do!

itypecode · 2026-02-22T01:34:49 1771724089

Solid dad move. XD

wilkystyle · 2026-02-22T01:45:27 1771724727

Is parenting making us better at prompt engineering, or is it the other way around?

fragmede · 2026-02-22T03:57:20 1771732640

Better yet, I have Codex, Gemini, and Claude as my kids, running around in my code playground. How do I be a good parent and not play favorites?

itypecode · 2026-02-22T07:12:18 1771744338

We all know Gemini is your artsy, Claude is your smartypants, and Codex is your nerd.

bpodgursky · 2026-02-22T01:59:20 1771725560

You bumped the token predictor into the latent space where it knew what it was doing : )

nazgul17 · 2026-02-22T04:03:07 1771732987

It's very much believable, to me.

In image generation, it's fairly common to add "masterpiece", for example.

I don't think of the LLM as a smart assistant that knows what I want. When I tell it to write some code, how does it know I want it to write the code like a world renowned expert would, rather than a junior dev?

I mean, certainly Anthropic has tried hard to make the former the case, but the Titanic inertia from internet scale data bias is hard to overcome. You can help the model with these hints.

Anyway, luckily this is something you can empirically verify. This way, you don't have to take anyone's word. If anything, if you find I'm wrong in your experiments, please share it!

MattGaiser · 2026-02-22T01:47:00 1771724820

One of the well defined failure modes for AI agents/models is "laziness." Yes, models can be "lazy" and that is an actual term used when reviewing them.

I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.

kannanvijayan · 2026-02-22T02:41:28 1771728088

I've encountered this failure mode, and the opposite of it: thinking too much. A behaviour I've come to see as some sort of pseudo-neuroticism.

Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.

Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.

denimnerd42 · 2026-02-22T03:07:57 1771729677

yeah i experienced this the other day when asking claude code to build an http proxy using an afsk modem software to communicate over the computers sound card. it had an absolute fit tuning the system and would loop for hours trying and doubling back. eventually after some change in prompt direction to think more deeply and test more comprehensively it figured it out. i certainly had no idea how to build a afsk modem.

popalchemist · 2026-02-22T02:45:17 1771728317

Strings of tokens are vectors. Vectors are directions. When you use a phrase like that you are orienting the vector of the overall prompt toward the direction of depth, in its map of conceptual space.

zmmmmm · 2026-02-22T03:57:49 1771732669

I actually don't really like a few of things about this approach.

First, the "big bang" write it all at once. You are going to end up with thousands of lines of code that were monolithically produced. I think it is much better to have it write the plan and formulate it as sensible technical steps that can be completed one at a time. Then you can work through them. I get that this is not very "vibe"ish but that is kind of the point. I want the AI to help me get to the same point I would be at with produced code AND understanding of it, just accelerate that process. I'm not really interested in just generating thousands of lines of code that nobody understands.

Second, the author keeps refering to adjusting the behaviour, but never incorporating that into long lived guidance. To me, integral with the planning process is building an overarching knowledge base. Every time you're telling it there's something wrong, you need to tell it to update the knowledge base about why so it doesn't do it again.

Finally, no mention of tests? Just quick checks? To me, you have to end up with comprehensive tests. Maybe to the author it goes without saying, but I find it is integral to build this into the planning. Certain stages you will want certain types of tests. Some times in advance of the code (so TDD style) other times built alongside it or after.

It's definitely going to be interesting to see how software methodology evolves to incorporate AI support and where it ultimately lands.

girvo · 2026-02-22T04:09:59 1771733399

The articles approach matches mine, but I've learned from exactly the things you're pointing out.

I get the PLAN.md (or equivalent) to be separated into "phases" or stages, then carefully prompt (because Claude and Codex both love to "keep going") it to only implement that stage, and update the PLAN.md

Tests are crucial too, and form another part of the plan really. Though my current workflow begins to build them later in the process than I would prefer...

red_hare · 2026-02-22T02:36:29 1771727789

I use Claude Code for lecture prep.

I craft a detailed and ordered set of lecture notes in a Quarto file and then have a dedicated claude code skill for translating those notes into Slidev slides, in the style that I like.

Once that's done, much like the author, I go through the slides and make commented annotations like "this should be broken into two slides" or "this should be a side-by-side" or "use your generate clipart skill to throw an image here alongside these bullets" and "pull in the code example from ../examples/foo." It works brilliantly.

And then I do one final pass of tweaking after that's done.

But yeah, annotations are super powerful. Token distance in-context and all that jazz.

saxelsen · 2026-02-22T02:56:21 1771728981

Can I ask how you annotate the feedback for it? Just with inline comments like `# This should be changed to X`?

The author mentions annotations but doesn't go into detail about how to feed the annotations to Claude.

red_hare · 2026-02-22T03:27:58 1771730878

Slidev is markdown, so i do it in html comments. Usually something like:

    <!-- TODOCLAUDE: Split this into a two-cols-title, divide the examples between -->

or

    <!-- TODOCLAUDE: Use clipart skill to make an image for this slide -->

And then, when I finish annotating I just say: "Address all the TODOCLAUDEs"

ramoz · 2026-02-22T02:42:19 1771728139

is your skill open source

red_hare · 2026-02-22T03:37:54 1771731474

Not yet... but also I'm not sure it makes a lot of sense to be open source. It's super specific to how I like to build slide decks and to my personal lecture style.

But it's not hard to build one. The key for me was describing, in great detail:

1. How I want it to read the source material (e.g., H1 means new section, H2 means at least one slide, a link to an example means I want code in the slide)

2. How to connect material to layouts (e.g., "comparison between two ideas should be a two-cols-title," "walkthrough of code should be two-cols with code on right," "learning objectives should be side-title align:left," "recall should be side-title align:right")

Then the workflow is:

1. Give all those details and have it do a first pass.

2. Give tons of feedback.

3. At the end of the session, ask it to "make a skill."

4. Manually edit the skill so that you're happy with the examples.

duttish · 2026-02-22T06:47:28 1771742848

This is quite close to what I've arrived at, but with two modifications

1) anything larger I work on in layers of docs. Architecture and requirements -> design -> implementation plan -> code. Partly it helps me think and nail the larger things first, and partly helps claude. Iterate on each level until I'm satisfied.

2) when doing reviews of each doc I sometimes restart the session and clear context, it often finds new issues and things to clear up before starting the next phase.

jamesmcq · 2026-02-22T01:23:31 1771723411

This all looks fine for someone who can't code, but for anyone with even a moderate amount of experience as a developer all this planning and checking and prompting and orchestrating is far more work than just writing the code yourself.

There's no winner for "least amount of code written regardless of productivity outcomes.", except for maybe Anthropic's bank account.

shepherdjerred · 2026-02-22T01:31:57 1771723917

I really don't understand why there are so many comments like this.

Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.

It took maybe 5-10 minutes of wall-time to come up with a good plan, and then ~20-30 min for Claude implement, test, etc.

That would've taken me at least a day, maybe two. I had 4-5 other tasks going on in other tabs while I waited the 20-30 min for Claude to generate the feature.

After Claude generated, I needed to manually test that it worked, and it did. I then needed to review the code before making a PR. In all, maybe 30-45 minutes of my actual time to add a small feature.

All I can really say is... are you sure you're using it right? Have you _really_ invested time into learning how to use AI tools?

tyleo · 2026-02-22T01:39:45 1771724385

Same here. I did bounce off these tools a year ago. They just didn't work for me 60% of the time. I learned a bit in that initial experience though and walked away with some tasks ChatGPT could replace in my workflow. Mainly replacing scripts and reviewing single files or functions.

Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer.

jamesmcq · 2026-02-22T01:48:26 1771724906

Trust me I'm very impressed at the progress AI has made, and maybe we'll get to the point where everything is 100% correct all the time and better than any human could write. I'm skeptical we can get there with the LLM approach though.

The problem is LLMs are great at simple implementation, even large amounts of simple implementation, but I've never seen it develop something more than trivial correctly. The larger problem is it's very often subtly but hugely wrong. It makes bad architecture decisions, it breaks things in pursuit of fixing or implementing other things. You can tell it has no concept of the "right" way to implement something. It very obviously lacks the "senior developer insight".

Maybe you can resolve some of these with large amounts of planning or specs, but that's the point of my original comment - at what point is it easier/faster/better to just write the code yourself? You don't get a prize for writing the least amount of code when you're just writing specs instead.

fourthark · 2026-02-22T02:04:21 1771725861

This is exactly what the article is about. The tradeoff is that you have to throughly review the plans and iterate on them, which is tiring. But the LLM will write good code faster than you, if you tell it what good code is.

reg_dunlop · 2026-02-22T03:02:48 1771729368

Exactly; the original commenter seems determined to write-off AI as "just not as good as me".

The original article is, to me, seemingly not that novel. Not because it's a trite example, but because I've begun to experience massive gains from following the same basic premise as the article. And I can't believe there's others who aren't using like this.

I iterate the plan until it's seemingly deterministic, then I strip the plan of implementation, and re-write it following a TDD approach. Then I read all specs, and generate all the code to red->green the tests.

If this commenter is too good for that, then it's that attitude that'll keep him stuck. I already feel like my projects backlog is achievable, this year.

fourthark · 2026-02-22T03:30:10 1771731010

Strongly agree about the deterministic part. Even more important than a good design, the plan must not show any doubt, whether it's in the form of open questions or weasel words. 95% of the time those vague words mean I didn't think something through, and it will do something hideous in order to make the plan work

nojito · 2026-02-22T01:52:09 1771725129

>I've never seen it develop something more than trivial correctly.

This is 100% incorrect, but the real issue is that the people who are using these llms for non-trivial work tend to be extremely secretive about it.

For example, I view my use of LLMs to be a competitive advantage and I will hold on to this for as long as possible.

jamesmcq · 2026-02-22T01:57:14 1771725434

The key part of my comment is "correctly".

Does it write maintainable code? Does it write extensible code? Does it write secure code? Does it write performant code?

My experience has been it failing most of these. The code might "work", but it's not good for anything more than trivial, well defined functions (that probably appeared in it's training data written by humans). LLMs have a fundamental lack of understanding of what they're doing, and it's obvious when you look at the finer points of the outcomes.

That said, I'm sure you could write detailed enough specs and provide enough examples to resolve these issues, but that's the point of my original comment - if you're just writing specs instead of code you're not gaining anything.

cowlby · 2026-02-22T02:12:02 1771726322

I find “maintainable code” the hardest bias to let go of. 15+ years of coding and design patterns are hard to let go.

But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.

Specs are worth it IMO. Not because if I can spec, I could’ve coded anyway. But because I gain all the insight and capabilities of AI, while minimizing the gotchas and edge failures.

girvo · 2026-02-22T04:22:53 1771734173

> But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.

How do you square that with the idea that all the code still has to be reviewed by humans? Yourself, and your coworkers

cowlby · 2026-02-22T04:40:52 1771735252

I picture like semi conductors; the 5nm process is so absurdly complex that operators can't just peek into the system easily. I imagine I'm just so used to hand crafting code that I can't imagine not being able to peek in.

So maybe it's that we won't be reviewing by hand anymore? I.e. it's LLMs all the way down. Trying to embrace that style of development lately as unnatural as it feels. We're obv not 100% there yet but Claude Opus is a significant step in that direction and they keep getting better and better.

girvo · 2026-02-22T05:57:17 1771739837

Then who is responsible when (not if) that code does horrible things? We have humans to blame right now. I just don’t see it happening personally because liability and responsibility are too important

therealdrag0 · 2026-02-22T07:45:36 1771746336

For some software, sure but not most.

And you don’t blame humans anyways lol. Everywhere I’ve worked has had “blameless” postmortems. You don’t remove human review unless you have reasonable alternatives like high test coverage and other automated reviews.

reg_dunlop · 2026-02-22T03:05:59 1771729559

To answer all of your questions:

yes, if I steer it properly.

It's very good at spotting design patterns, and implementing them. It doesn't always know where or how to implement them, but that's my job.

The specs and syntactic sugar are just nice quality of life benefits.

jmathai · 2026-02-22T02:08:12 1771726092

You’d be building blocks which compound over time. That’s been my experience anyway.

The compounding is much greater than my brain can do on its own.

skydhash · 2026-02-22T02:39:18 1771727958

> Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.

But did you truly think about such feature? Like guarantees that it should follow (like how do it should cope with entities migration like adding a new field) or what the cost of maintaining it further down the line. This looks suspiciously like drive-by PR made on open-source projects.

> That would've taken me at least a day, maybe two.

I think those two days would have been filled with research, comparing alternatives, questions like "can we extract this feature from framework X?", discussing ownership and sharing knowledge,.. Jumping on coding was done before LLMs, but it usually hurts the long term viability of the project.

Adding code to a project can be done quite fast (hackatons,...), ensuring quality is what slows things down in any any well functioning team.

streetfighter64 · 2026-02-22T01:44:25 1771724665

I mean, all I can really say is... if writing some logging takes you one or two days, are you sure you _really_ know how to code?

boxedemp · 2026-02-22T02:17:48 1771726668

Ever worked on a distributed system with hundreds of millions of customers and seemingly endless business requirements?

Some things are complex.

therealdrag0 · 2026-02-22T07:50:03 1771746603

Audit logging is different than developer logging… companies will have entire teams dedicated to audit systems.

shepherdjerred · 2026-02-22T01:46:21 1771724781

You're right, you're better than me!

You could've been curious and ask why it would take 1-2 days, and I would've happily told you.

jamesmcq · 2026-02-22T02:02:50 1771725770

I'll bite, because it does seem like something that should be quick in a well-architected codebase. What was the situation? Was there something in this codebase that was especially suited to AI-development? Large amounts of duplication perhaps?

shepherdjerred · 2026-02-22T02:19:10 1771726750

It's not particularly interesting.

I wanted to add audit logging for all endpoints we call, all places we call the DB, etc. across areas I haven't touched before. It would have taken me a while to track down all of the touchpoints.

Granted, I am not 100% certain that Claude didn't miss anything. I feel fairly confident that it is correct given that I had it research upfront, had multiple agents review, and it made the correct changes in the areas that I knew.

Also I'm realizing I didn't mention it included an API + UI for viewing events w/ pretty deltas

fendy3002 · 2026-02-22T06:53:23 1771743203

Well someone who says logging is easy never knows the difficulty of deciding "what" to log. And audit log is different beast altogether than normal logging

fragmede · 2026-02-22T02:15:39 1771726539

We're not as good at coding as you, naturally.

psvv · 2026-02-22T05:22:47 1771737767

I'd find it deeply funny if the optimal vibe coding workflow continues to evolve to include more and more human oversight, and less and less agent autonomy, to the point where eventually someone makes a final breakthrough that they can save time by bypassing the LLM entirely and writing the code themselves. (Finally coming full circle.)

skeledrew · 2026-02-22T02:19:59 1771726799

Researching and planning a project is a generally usefully thing. This is something I've been doing for years, and have always had great results compared to just jumping in and coding. It makes perfect sense that this transfers to LLM use.

roncesvalles · 2026-02-22T03:40:56 1771731656

Well it's less mental load. It's like Tesla's FSD. Am I a better driver than the FSD? For sure. But is it nice to just sit back and let it drive for a bit even if it's suboptimal and gets me there 10% slower, and maybe slightly pisses off the guy behind me? Yes, nice enough to shell out $99/mo. Code implementation takes a toll on you in the same way that driving does.

I think the method in TFA is overall less stressful for the dev. And you can always fix it up manually in the end; AI coding vs manual coding is not either-or.

kburman · 2026-02-22T02:08:44 1771726124

Since Opus 4.5, things have changed quite a lot. I find LLMs very useful for discussing new features or ideas, and Sonnet is great for executing your plan while you grab a coffee.

dmix · 2026-02-22T01:28:17 1771723697

Most of these AI coding articles seem to be about greenfield development.

That said, if you're on a serious team writing professional software there is still tons of value in always telling AI to plan first, unless it's a small quick task. This post just takes it a few steps further and formalizes it.

I find Cursor works much more reliably using plan mode, reviewing/revising output in markdown, then pressing build. Which isn't a ton of overhead but often leads to lots of context switching as it definitely adds more time.

keyle · 2026-02-22T01:42:11 1771724531

I partly agree with you. But once you have a codebase large enough, the changes become longer to even type in, once figured out.

I find the best way to use agents (and I don't use claude) is to hash it out like I'm about to write these changes and I make my own mental notes, and get the agent to execute on it.

Agents don't get tired, they don't start fat fingering stuff at 4pm, the quality doesn't suffer. And they can be parallelised.

Finally, this allows me to stay at a higher level and not get bogged down of "right oh did we do this simple thing again?" which wipes some of the context in my mind and gets tiring through the day.

Always, 100% review every line of code written by an agent though. I do not condone committing code you don't 'own'.

I'll never agree with a job that forces developers to use 'AI', I sometimes like to write everything by hand. But having this tool available is also very powerful.

Quothling · 2026-02-22T06:51:10 1771743070

I think it comes down to "it depends". I work in a NIS2 regulated field and we're quite callenged by the fact that it means we can't give AI's any sort of real access because of the security risk. To be complaint we'd have to have the AI agent ask permission for every single thing it does, before it does it, and foureye review it. Which is obviously never going to happen. We can discuss how bad the NIS2 foureye requirement works in the real world another time, but considering how easy it is to break AI security, it might not be something we can actually ever use. This makes sense on some of the stuff we work on, since it could bring an entire powerplant down. On the flip-side AI risks would be of little concern on a lot of our internal tools, which are basically non-regulated and unimportant enough that they can be down for a while without costing the business anything beyond annoyances.

This is where our challenges are. We've build our own chatbot where you can "build" your own agent within the librechat framework and add a "skill" to it. I say "skill" because it's older than claude skills but does exactly the same. I don't completely buy the authors:

> “deeply”, “in great details”, “intricacies”, “go through everything”

bit, but you can obviously save a lot of time by writing a piece of english which tells it what sort of environment you work in. It'll know that when I write Python I use UV, Ruff and Pyrefly and so on as an example. I personally also have a "skill" setting that tells the AI not to compliment me because I find that ridicilously annoying, and that certainly works. So who knows? Anyway, employees are going to want more. I've been doing some PoC's running open source models in isolation on a raspberry pi (we had spares because we use them in IoT projects) but it's hard to setup an isolation policy which can't be circumvented.

We'll have to figure it out though. For powerplant critical projects we don't want to use AI. But for the web tool that allows a couple of employees to upload three excel files from an external accountant and then generate some sort of report on them? Who cares who writes it or even what sort of quality it's written with? The lifecycle of that tool will probably be something that never changes until the external account does and then the tool dies. Not that it would have necessarily been written in worse quality without AI... I mean... Have you seen some of the stuff we've written in the past 40 years?

jamesmcq · 2026-02-22T01:51:54 1771725114

I want to be clear, I'm not against any use of AI. It's hugely useful to save a couple of minutes of "write this specific function to do this specific thing that I could write and know exactly what it would look like". That's a great use, and I use it all the time! It's better autocomplete. Anything beyond that is pushing it - at the moment! We'll see, but spending all day writing specs and double-checking AI output is not more productive than just writing correct code yourself the first time, even if you're AI-autocompleting some of it.

skeledrew · 2026-02-22T02:47:59 1771728479

For the last few days I've been working on a personal project that's been on ice for at least 6 years. Back when I first thought of the project and started implementing it, it took maybe a couple weeks to eke out some minimally working code.

This new version that I'm doing (from scratch with ChatGPT web) has a far more ambitious scope and is already at the "usable" point. Now I'm primarily solidifying things and increasing test coverage. And I've tested the key parts with IRL scenarios to validate that it's not just passing tests; the thing actually fulfills its intended function so far. Given the increased scope, I'm guessing it'd take me a few months to get to this point on my own, instead of under a week, and the quality wouldn't be where it is. Not saying I haven't had to wrangle with ChatGPT on a few bugs, but after a decent initial planning phase, my prompts now are primarily "Do it"s and "Continue"s. Would've likely already finished it if I wasn't copying things back and forth between browser and editor, and being forced to pause when I hit the message limit.

keyle · 2026-02-22T02:54:31 1771728871

This is a great come-back story. I have had a similar experience with a photoshop demake of mine.

I recommend to try out Opencode with this approach, you might find it less tiring than ChatGPT web (yes it works with your ChatGPT Plus sub).

phantomathkg · 2026-02-22T02:49:56 1771728596

Surely Addy Osmani can code. Even he suggests plan first.

https://news.ycombinator.com/item?id=46489061

stealthyllama · 2026-02-22T04:25:45 1771734345

There is a miscommunication happening, this entire time we all had surprisingly different ideas about what quality of work is acceptable which seems to account for differences of opinion on this stuff.

skydhash · 2026-02-22T03:02:43 1771729363

> planning and checking and prompting and orchestrating is far more work than just writing the code yourself.

This! Once I'm familiar with the codebase (which I strive to do very quickly), for most tickets, I usually have a plan by the time I've read the description. I can have a couple of implementation questions, but I knew where the info is located in the codebase. For things, I only have a vague idea, the whiteboard is where I go.

The nice thing with such a mental plan, you can start with a rougher version (like a drawing sketch). Like if I'm starting a new UI screen, I can put a placeholder text like "Hello, world", then work on navigation. Once that done, I can start to pull data, then I add mapping functions to have a view model,...

Each step is a verifiable milestone. Describing them is more mentally taxing than just writing the code (which is a flow state for me). Why? Because English is not fit to describe how computer works (try describe a finite state machine like navigation flow in natural languages). My mental mental model is already aligned to code, writing the solution in natural language is asking me to be ambiguous and unclear on purpose.

zahlman · 2026-02-22T06:13:53 1771740833

> After Claude writes the plan, I open it in my editor and add inline notes directly into the document. These notes correct assumptions, reject approaches, add constraints, or provide domain knowledge that Claude doesn’t have.

This is the part that seems most novel compared to what I've heard suggested before. And I have to admit I'm a bit skeptical. Would it not be better to modify what Claude has written directly, to make it correct, rather than adding the corrections as separate notes (and expecting future Claude to parse out which parts were past Claude and which parts were the operator, and handle the feedback graciously)?

At least, it seems like the intent is to do all of this in the same session, such that Claude has the context of the entire back-and-forth updating the plan. But that seems a bit unpleasant; I would think the file is there specifically to preserve context between sessions.

fendy3002 · 2026-02-22T06:36:01 1771742161

One reason why I don't do this: even I won't be immune to mistakes. When I fix it with new values or paths, for example, and the one I provided is wrong, it can worsen the future work.

Personally, I like to order claude one more time to update the plan file after I have given annotation, and review it again after. This will ensure (from my understanding) that claude won't treat my annotation as different instructions, thus risking the work being conflicted.

swe_dima · 2026-02-22T06:11:20 1771740680

Since everyone is showing their flow, here's mine:

* create a feature-name.md file in a gitignored folder

* start the file by giving the business context

* describe a high-level implementation and user flows

* describe database structure changes (I find it important not to leave it for interpretation)

* ask Claude to inspect the feature and review if for coherence, while answering its questions I ask to augment feature-name.md file with the answers

* enter Claude's plan mode and provide that feature-name.md file

* at this point it's detailed enough that rarely any corrections from me are needed

umairnadeem123 · 2026-02-22T04:26:28 1771734388

The multi-pass approach works outside of code too. I run a fairly complex automation pipeline (prompt -> script -> images -> audio -> video assembly) and the single biggest quality improvement was splitting generation into discrete planning and execution phases. One-shotting a 10-step pipeline means errors compound. Having the LLM first produce a structured plan, then executing each step against that plan with validation gates between them, cut my failure rate from maybe 40% to under 10%. The planning doc also becomes a reusable artifact you can iterate on without re-running everything.

brandall10 · 2026-02-22T01:54:13 1771725253

I go a bit further than this and have had great success with 3 doc types and 2 skills:

- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.

- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.

- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)

- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.

- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.

I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.

And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.

jcurbo · 2026-02-22T03:13:44 1771730024

This is pretty much my approach. I started with some spec files for a project I'm working on right now, based on some academic papers I've written. I ended up going back and forth with Claude, building plans, pushing info back into the specs, expanding that out and I ended up with multiple spec/architecture/module documents. I got to the point where I ended up building my own system (using claude) to capture and generate artifacts, in more of a systems engineering style (e.g. following IEEE standards for conops, requirement documents, software definitions, test plans...). I don't use that for session-level planning; Claude's tools work fine for that. (I like superpowers, so far. It hasn't seemed too much)

I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track.

I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want.

r1290 · 2026-02-22T01:59:03 1771725543

Looks good. Question - is it always better to use a monorepo in this new AI world? Vs breaking your app into separate repos? At my company we have like 6 repos all separate nextjs apps for the same user base. Trying to consolidate to one as it should make life easier overall.

throwup238 · 2026-02-22T02:15:11 1771726511

It really depends but there’s nothing stopping you from just creating a separate folder with the cloned repositories (or worktrees) that you need and having a root CLAUDE.md file that explains the directory structure and referencing the individual repo CLAUDE.md files.

oa335 · 2026-02-22T02:11:30 1771726290

Just put all the repos in all in one directory yourself. In my experience that works pretty well.

chickensong · 2026-02-22T02:23:13 1771726993

AI is happy to work with any directory you tell it to. Agent files can be applied anywhere.

wokwokwok · 2026-02-22T04:54:06 1771736046

This is the way.

The practice is:

- simple

- effective

- retains control and quality

Certainly the “unsupervised agent” workflows are getting a lot of attention right now, but they require a specific set of circumstances to be effective:

- clear validation loop (eg. Compile the kernel, here is gcc that does so correctly)

- ai enabled tooling (mcp / cli tool that will lint, test and provide feedback immediately)

- oversight to prevent sgents going off the rails (open area of research)

- an unlimited token budget

That means that most people can't use unsupervised agents.

Not that they dont work; Most people have simply not got an environment and task that is appropriate.

By comparison, anyone with cursor or claude can immediately start using this approach, or their own variant on it.

It does not require fancy tooling.

It does not require an arcane agent framework.

It works generally well across models.

This is one of those few genunie pieces of good practical advice for people getting into AI coding.

Simple. Obviously works once you start using it. No external dependencies. BYO tools to help with it, no “buy my AI startup xxx to help”. No “star my github so I can a job at $AI corp too”.

Great stuff.

basch · 2026-02-22T06:01:27 1771740087

Honesty this is just language models in general at the moment, and not just coding.

It’s the same reason adding a thinking step works.

You want to write a paper, you have it form a thesis and structure first. (In this one you might be better off asking for 20 and seeing if any of them are any good.) You want to research something, first you add gathering and filtering steps before synthesis.

Adding smarter words or telling it to be deeper does work by slightly repositioning where your query ends up in space.

Asking for the final product first right off the bat leads to repetitive verbose word salad. It just starts to loop back in on itself. Which is why temperature was a thing in the first place, and leads me to believe they’ve turned the temp down a bit to try and be more accurate. Add some randomness and variability to your prompts to compensate.

wazHFsRy · 2026-02-22T05:57:54 1771739874

Absolutely. And you can also always let the agent look back at the plan to check if it is still on track and aligned.

One step I added, that works great for me, is letting it write (api-level) tests after planning and before implementation. Then I’ll do a deep review and annotation of these tests and tweak them until everything is just right.

epec254 · 2026-02-22T05:11:24 1771737084

Huge +1. This loop consistently delivers great results for my vibe coding.

The “easy” path of “short prompt declaring what I want” works OK for simple tasks but consistently breaks down for medium to high complexity tasks.

apsurd · 2026-02-22T05:40:06 1771738806

Can you help me understand the difference between "short prompt for what I want (next)" vs medium to high complexity tasks?

What i mean is, in practice, how does one even get to a a high complexity task? What does that look like? Because isn't it more common that one sees only so far ahead?

dnautics · 2026-02-22T05:56:51 1771739811

It's more or less what comes out of the box with plan mode, plus a few extra bits?

raptorraver · 2026-02-22T06:15:22 1771740922

I’ve been using this same pattern, except not the research phase. Definetly will try to add it to my process aswell.

Sometimes when doing big task I ask claude to implement each phase seprately and review the code after each step.

lastdong · 2026-02-22T07:52:23 1771746743

Google Anti-Gravity has this process built in. This is essentially a cycle a developer would follow: plan/analyse - document/discuss - break down tasks/implement. We’ve been using requirements and design documents as best practice since leaving our teenage bedroom lab for the professional world. I suppose this could be seen as our coding agents coming of age.

Merad · 2026-02-22T06:13:08 1771740788

I've been working off and on on a vibe coded FP language and transpiler - mostly just to get more experience with Claude Code and see how it handles complex real world projects. I've settled on a very similar flow, though I use three documents: plan, context, task list. Multiple rounds of iteration when planning a feature. After completion, have a clean session do an audit to confirm that everything was implemented per the design. Then I have both Claude and CodeRabbit do code review passes before I finally do manual review. VERY heavy emphasis on tests, the project currently has 2x more test code than application code. So far it works surprisingly well. Example planning docs below -

https://github.com/mbcrawfo/vibefun/tree/main/.claude/archiv...

pgt · 2026-02-22T07:49:31 1771746571

My process is similar, but I recently added a new "critique the plan" feedback loop that is yielding good results. Steps:

1. Spec

2. Plan

3. Read the plan & tell it to fix its bad ideas.

4. (NB) Critique the plan (loop) & write a detailed report

5. Update the plan

6. Review and check the plan

7. Implement plan

Detailed here:

https://x.com/PetrusTheron/status/2016887552163119225

brumar · 2026-02-22T07:53:33 1771746813

Same. In my experience, the first plan always benefits from being challenged once or twice by claude itself.

nerdright · 2026-02-22T06:42:06 1771742526

Haha this is surprisingly and exactly how I use claude as well. Quite fascinating that we independently discovered the same workflow.

I maintain two directories: "docs/proposals" (for the research md files) and "docs/plans" (for the planning md files). For complex research files, I typically break them down into multiple planning md files so claude can implement one at a time.

A small difference in my workflow is that I use subagents during implementation to avoid context from filling up quickly.

brendanmc6 · 2026-02-22T06:54:40 1771743280

Same, I formalized a similar workflow for my team (oriented around feature requirement docs), I am thinking about fully productizing it and am looking to for feedback - https://acai.sh

Even if the product doesn’t resonate I think I’ve stumbled on some ideas you might find useful^

I do think spec-driven development is where this all goes. Still making up my mind though.

puchatek · 2026-02-22T08:06:48 1771747608

Spec-driven looks very much like what the author describes. He may have some tweaks of his own but they could just as well be coded into the artifacts that something like OpenSpec produces.

clouedoc · 2026-02-22T07:03:33 1771743813

This is basically long-lived specs that are used as tests to check that the product still adheres to the original idea that you wanted to implement, right?

This inspired me to finally write good old playwright tests for my website :).

throwaway7783 · 2026-02-22T06:05:32 1771740332

I have to give this a try. My current model for backend is the same as how author does frontend iteration. My friend does the research-plan-edit-implement loop, and there is no real difference between the quality of what I do and what he does. But I do like this just for how it serves as documentation of the thought process across AI/human, and can be added to version control. Instead of humans reviewing PRs, perhaps humans can review the research/plan document.

On the PR review front, I give Claude the ticket number and the branch (or PR) and ask it to review for correctness, bugs and design consistency. The prompt is always roughly the same for every PR. It does a very good job there too.

Modelwise, Opus 4.6 is scary good!

deevus · 2026-02-22T01:28:58 1771723738

This is what I do with the obra/superpowers[0] set of skills.

1. Use brainstorming to come up with the plan using the Socratic method

2. Write a high level design plan to file

3. I review the design plan

4. Write an implementation plan to file. We've already discussed this in detail, so usually it just needs skimming.

5. Use the worktree skill with subagent driven development skill

6. Agent does the work using subagents that for each task:

  a. Implements the task

  b. Spec reviews the completed task

  c. Code reviews the completed task

7. When all tasks complete: create a PR for me to review

8. Go back to the agent with any comments

9. If finished, delete the plan files and merge the PR

[0]: https://github.com/obra/superpowers

moribunda · 2026-02-22T06:42:08 1771742528

The crowd around this pot shows how superficial is knowledge about claude code. It gets releases each day and most of this is already built in the vanilla version. Not to mention subagent working in work trees, memory.md, plan on which you can comment directly from the interface, subagents launched in research phase, but also some basic mcp's like LSP/IDE integration, and context7 to not to be stuck in the knowledge cutoff/past.

When you go to YouTube and search for stuff like "7 levels of claude code" this post would be maybe 3-4.

Oh, one more thing - quality is not consistent, so be ready for 2-3 rounds of "are you happy with the code you wrote" and defining audit skills crafted for your application domain - like for example RODO/Compliance audit etc.

deevus · 2026-02-22T07:12:07 1771744327

I'm using the in-built features as well, but I like the flow that I have with superpowers. You've made a lot of assumptions with your comment that are just not true (at least for me).

I find that brainstorming + (executing plans OR subagent driven development) is way more reliable than the built-in tooling.

ramoz · 2026-02-22T01:32:14 1771723934

If you’ve ever desired the ability for annotating the plan more visually, try fitting Plannotator in this workflow. There is a slash command for use when you use custom workflows outside of normal plan mode.

https://github.com/backnotprop/plannotator

deevus · 2026-02-22T01:34:57 1771724097

I'll give this a try. Thanks for the suggestion.

turingsroot · 2026-02-22T05:01:05 1771736465

I've been teaching AI coding tool workshops for the past year and this planning-first approach is by far the most reliable pattern I've seen across skill levels.

The key insight that most people miss: this isn't a new workflow invented for AI - it's how good senior engineers already work. You read the code deeply, write a design doc, get buy-in, then implement. The AI just makes the implementation phase dramatically faster.

What I've found interesting is that the people who struggle most with AI coding tools are often junior devs who never developed the habit of planning before coding. They jump straight to "build me X" and get frustrated when the output is a mess. Meanwhile, engineers with 10+ years of experience who are used to writing design docs and reviewing code pick it up almost instantly - because the hard part was always the planning, not the typing.

One addition I'd make to this workflow: version your research.md and plan.md files in git alongside your code. They become incredibly valuable documentation for future maintainers (including future-you) trying to understand why certain architectural decisions were made.

dennisjoseph · 2026-02-22T03:49:11 1771732151

The annotation cycle is the key insight for me. Treating the plan as a living doc you iterate on before touching any code makes a huge difference in output quality.

Experimentally, i've been using mfbt.ai [https://mfbt.ai] for roughly the same thing in a team context. it lets you collaboratively nail down the spec with AI before handing off to a coding agent via MCP.

Avoids the "everyone has a slightly different plan.md on their machine" problem. Still early days but it's been a nice fit for this kind of workflow.

minikomi · 2026-02-22T03:54:58 1771732498

I agree, and this is why I tend to use gptel in emacs for planning - the document is the conversation context, and can be edited and annotated as you like.

zitrusfrucht · 2026-02-22T01:13:01 1771722781

I do something very similar, also with Claude and Codex, because the workflow is controlled by me, not by the tool. But instead of plan.md I use a ticket system basically like ticket_<number>_<slug>.md where I let the agent create the ticket from a chat, correct and annotate it afterwards and send it back, sometimes to a new agent instance. This workflow helps me keeping track of what has been done over time in the projects I work on. Also this approach does not need any „real“ ticket system tooling/mcp/skill/whatever since it works purely on text files.

gbnwl · 2026-02-22T01:24:03 1771723443

+1 to creating tickets by simply asking the agent to. It's worked great and larger tasks can be broken down into smaller subtasks that could reasonably be completed in a single context window, so you rarely every have to deal with compaction. Especially in the last few months since Claude's gotten good at dispatching agents to handle tasks if you ask it to, I can plan large changes that span multilpe tickets and tell claude to dispatch agents as needed to handle them (which it will do in parallel if they mostly touch different files), keeping the main chat relatively clean for orchestration and validation work.

ramoz · 2026-02-22T02:43:30 1771728210

semantic plan name is important

paradite · 2026-02-22T05:25:34 1771737934

Lol I wrote about this and been using plan+execute workflow for 8 months.

Sadly my post didn't much attention at the time.

https://thegroundtruth.media/p/my-claude-code-workflow-and-p...

srid · 2026-02-22T01:17:03 1771723023

Regarding inline notes, I use a specific format in the `/plan` command, by using th `ME:` prefix.

https://github.com/srid/AI/blob/master/commands/plan.md#2-pl...

It works very similar to Antigravity's plan document comment-refine cycle.

https://antigravity.google/docs/implementation-plan

mukundesh · 2026-02-22T05:14:57 1771737297

https://github.blog/ai-and-ml/generative-ai/spec-driven-deve...

efnx · 2026-02-22T06:08:19 1771740499

I’ve been using Claude through opencode, and I figured this was just how it does it. I figured everyone else did it this way as well. I guess not!

cadamsdotcom · 2026-02-22T03:47:55 1771732075

The author is quite far on their journey but would benefit from writing simple scripts to enforce invariants in their codebase. Invariant broken? Script exits with a non-zero exit code and some output that tells the agent how to address the problem. Scripts are deterministic, run in milliseconds, and use zero tokens. Put them in husky or pre-commit, install the git hooks, and your agent won’t be able to commit without all your scripts succeeding.

And “Don’t change this function signature” should be enforced not by anticipating that your coding agent “might change this function signature so we better warn it not to” but rather via an end to end test that fails if the function signature is changed (because the other code that needs it not to change now has an error). That takes the author out of the loop and they can not watch for the change in order to issue said correction, and instead sip coffee while the agent observes that it caused a test failure then corrects it without intervention, probably by rolling back the function signature change and changing something else.

armanj · 2026-02-22T04:10:59 1771733459

> “remove this section entirely, we don’t need caching here” — rejecting a proposed approach

I wonder why you don't remove it yourself. Aren't you already editing the plan?

amarant · 2026-02-22T03:53:21 1771732401

Interesting! I feel like I'm learning to code all over again! I've only been using Claude for a little more than a month and until now I've been figuring things out on my own. Building my methodology from scratch. This is much more advanced than what I'm doing. I've been going straight to implementation, but doing one very small and limited feature at a time, describing implementation details (data structures like this, use that API here, import this library etc) verifying it manually, and having Claude fix things I don't like. I had just started getting annoyed that it would make the same (or very similar) mistake over and over again and I would have to fix it every time. This seems like it'll solve that problem I had only just identified! Neat!

cheekyant · 2026-02-22T07:21:13 1771744873

It seems like the annotation of plan files is the key step.

Claude Code now creates persistent markdown plan files in ~/.claude/plans/ and you can open them with Ctrl-G to annotate them in your default editor.

So plan mode is not ephemeral any more.

RHSeeger · 2026-02-22T02:21:20 1771726880

> Most developers type a prompt, sometimes use plan mode, fix the errors, repeat.

> ...

> never let Claude write code until you’ve reviewed and approved a written plan

I certainly always work towards an approved plan before I let it lost on changing the code. I just assumed most people did, honestly. Admittedly, sometimes there's "phases" to the implementation (because some parts can be figured out later and it's more important to get the key parts up and running first), but each phase gets a full, reviewed plan before I tell it to go.

In fact, I just finished writing a command and instruction to tell claude that, when it presents a plan for implementation, offer me another option; to write out the current (important parts of the) context and the full plan to individual (ticket specific) md files. That way, if something goes wrong with the implementation I can tell it to read those files and "start from where they left off" in the planning.

ramoz · 2026-02-22T02:46:09 1771728369

The author seems to think theyve invented a special workflow...

We all tend to regress to average (same thoughts/workflows)...

Have had many users already doing the exact same workflow with: https://github.com/backnotprop/plannotator

CGamesPlay · 2026-02-22T03:00:06 1771729206

4 times in one thread, please stop spamming this link.

kulikalov · 2026-02-22T06:13:40 1771740820

I came to the exact same pattern, with one extra heuristic at the end: spin up a new claude instance after the implementation is complete and ask it to find discrepancies between the plan and the implementation.

strix_varius · 2026-02-22T06:14:22 1771740862

The baffling part of the article is all the assertions about how this is unique, novel, not the typical way people are doing this etc.

There are whole products wrapped around this common workflow already (like Augment Intent).

cowlby · 2026-02-22T01:58:36 1771725516

I recently discovered GitHub speckit which separates planning/execution in stages: specify, plan, tasks, implement. Finding it aligns with the OP with the level of “focus” and “attention” this gets out of Claude Code.

Speckit is worth trying as it automates what is being described here, and with Opus 4.6 it's been a kind of BC/AD moment for me.

wangzhongwang · 2026-02-22T05:38:58 1771738738

Interesting approach. The separation of planning and execution is crucial, but I think there's a missing layer most people overlook: permission boundaries between the two phases.

Right now when Claude Code (or any agent) executes a plan, it typically has the same broad permissions for every step. But ideally, each execution step should only have access to the specific tools and files it needs — least privilege, applied to AI workflows.

I've been experimenting with declarative permission manifests for agent tasks. Instead of giving the agent blanket access, you define upfront what each skill can read, write, and execute. Makes the planning phase more constrained but the execution phase much safer.

Anyone else thinking about this from a security-first angle?

DevEx7 · 2026-02-22T05:01:13 1771736473

I’m a big fan of having the model create a GitHub issue directly (using the GH CLI) with the exact plan it generates, instead of creating a markdown file that will eventually get deleted. It gives me a permanent record and makes it easy to reference and close the issue once the PR is ready.

Frannky · 2026-02-22T03:50:32 1771732232

I tried Opus 4.6 recently and it’s really good. I had ditched Claude a long time ago for Grok + Gemini + OpenCode with Chinese models. I used Grok/Gemini for planning and core files, and OpenCode for setup, running, deploying, and editing.

However, Opus made me rethink my entire workflow. Now, I do it like this:

* PRD (Product Requirements Document)

* main.py + requirements.txt + readme.md (I ask for minimal, functional, modular code that fits the main.py)

* Ask for a step-by-step ordered plan

* Ask to focus on one step at a time

The super powerful thing is that I don’t get stuck on missing accounts, keys, etc. Everything is ordered and runs smoothly. I go rapidly from idea to working product, and it’s incredibly easy to iterate if I figure out new features are required while testing. I also have GLM via OpenCode, but I mainly use it for "dumb" tasks.

Interestingly, for reasoning capabilities regarding standard logic inside the code, I found Gemini 3 Flash to be very good and relatively cheap. I don't use Claude Code for the actual coding because forcing everything via chat into a main.py encourages minimal code that's easy to skim—it gives me a clearer representation of the feature space

growt · 2026-02-22T07:36:53 1771745813

That is just spec driven development without a spec, starting with the plan step instead.

rotbart · 2026-02-22T06:07:18 1771740438

This is a similar workflow to speckit, kiro, gsd, etc.

geoffbp · 2026-02-22T07:11:21 1771744281

It’s worrying to me that nobody really knows how LLMs work. We create prompts with or without certain words and hope it works. That’s my perspective anyway

mannyv · 2026-02-22T07:14:52 1771744492

It's actually no different from how real software is made. Requirements come from the business side, and through an odd game of telephone get down to developers.

The team that has developers closest to the customer usually makes the better product...or has the better product/market fit.

Then it's iteration.

solumunus · 2026-02-22T07:13:50 1771744430

It's the same as dealing with a human. You convey a spec for a problem and the language you use matters. You can convey the problem in (from your perspective) a clear way and you will get mixed results nonetheless. You will have to continue to refine the solution with them.

Genuinely: no one really knows how humans work either.