> We recommend using the word "think" to trigger extended thinking mode, which gives Claude additional computation time to evaluate alternatives more thoroughly. These specific phrases are mapped directly to increasing levels of thinking budget in the system: "think" < "think hard" < "think harder" < "ultrathink." Each level allocates progressively more thinking budget for Claude to use.
I had a poke around and it's not a feature of the Claude model, it's specific to Claude Code. There's a "megathink" option too - it uses code that looks like this:
let B = W.message.content.toLowerCase();
if (
B.includes("think harder") ||
B.includes("think intensely") ||
B.includes("think longer") ||
B.includes("think really hard") ||
B.includes("think super hard") ||
B.includes("think very hard") ||
B.includes("ultrathink")
)
return (
l1("tengu_thinking", { tokenCount: 31999, messageId: Z, provider: G }),
31999
);
if (
B.includes("think about it") ||
B.includes("think a lot") ||
B.includes("think deeply") ||
B.includes("think hard") ||
B.includes("think more") ||
B.includes("megathink")
)
return (
l1("tengu_thinking", { tokenCount: 1e4, messageId: Z, provider: G }), 1e4
);
Not gonna lie: the "ultrathink" keyword that Sonnet 3.7 with thinking tokens watches for gives me "doubleplusgood" vibes in a hilarious but horrifying way.
If you use any of the more direct API sandbox/studio UIs, there are already various sliders, temperature (essentially randomness vs. predictability) being the most common.
The consumer-facing chatbot interfaces just hide all that because they're aiming for a non-technical audience.
That's awesome, and almost certainly an Unreal Tournament reference (when you chain enough kills in short time it moves through a progression that includes "megakill" and "ultrakill").
LLMs with broad contextual capabilities shouldn't need to be guided in this manor. Claude can tell a trivial task from a complex one just as easily as I can, and should self-adjust, up to thresholds of compute spending, etc.
What I don’t like about Claude Code is why can’t they give command line flags for this stuff? It’s better documented and people don’t have to discover this the hard way.
Similarly, I do miss an —add command line flag to manual specify the context (files) during the session. Right now I pretty much end up copy pasting the relative paths from VSCode and supply to Claude. Aider has much better semantics for such stuff.
Surprised that "controlling cost" isn't a section in this post. Here's my attempt.
---
If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I would not be surprised if you're seeing high cost.
Be aware of the "cache".
Tell it to read specific files (and only those!), if you don't, it'll read unnecessary files, or repeatedly read sections of files or even search through files.
Avoid letting it search - even halt it. Find / rg can have a thousands of tokens of output depending on the search.
Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.
The cache also goes away after 5-15 minutes or so (not sure) - so avoid leaving sessions open and coming back later.
Never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).
Don't let files get too big (it's good hygiene too) to keep the context window sizes smaller.
Have a clear goal in mind and keep sessions to as few messages as possible.
Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.
I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).
If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).
For hobby stuff, it adds up - totally.
For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).
If I have to be so cautious while using a tool might as well write the code myself lol.
I’ve used Claude Code extensively and it is one of the best AI IDE. It just gets things done.
The only downside is the cost. I was averaging $35-$40/day. At this cost, I’d rather just use Cursor/Windsurf.
Not having to specify files is a humongous feature for me. Having to remember which file code is in is half the work once you pass a certain codebase size.
That sometimes work sometimes doesn’t and takes 10x time. Same with codex. I would have both and switch between them depending on what you feel will get it right better
Yeah, I tried CC out and quickly noticed it was spending $5+ for simple LLM capable tasks. I rarely break $1-2 a session using aider. Aider feels like more of a precision tool. I like having the ability to manually specify.
I do find Claude Code to be really good at exploration though - like checking out a repository I'm unfamiliar with and then asking questions about it.
After switching to Aider, I realized the other tools have been playing elaborate games to choose cheaper models and to limit files and messages in context, both of which increase their bills.
Aider is a great tool. I do love it. But I find I have to do more with it to get the same output as Claude Code (no matter what LLM I used with Aider). Sure it may end up being cheaper per run, but not when my time is factored in.
The flip side is I find Aider much easier to limit.
Get an Openrouter account and you can play with almost all providers, I was burning money on Claude, tried V3 (blocked Deepseek provider for being flaky, let the laypeople mock them) and experimental and GA Gemini models.
The cost of the task scales with how long it takes, plus or minus.
Substitute “cost” with “time” in the above post and all of the same tips are still valuable.
I don’t do much agentic LLM coding but the speed (or lack thereof) was one of my least favorite parts. Using any tricks that narrow scope, prevent reprocessing files over and over again, or searching through the codebase are all helpful even if you don’t care about the dollar amount.
Hard agree. Whether it's 50 cents or 10 dollars per session, I'm using it to get work done for the sake of quickly completing work that aims to unblock many orders of magnitude more value. But in so far as cheaper correct sessions correlate with sessions where the problem solving was more efficient anyhow, they're fairly solid tips.
I agree but optimisation often reveals implementation details helping to understand limits of current tech more. It might not be worth the time but part of engineering is optimisation and another part is deep understanding of tech. It is sometimes worth optimising anyway if you want to take the engineering discipline to the next level within yourself.
I myself didn’t think about not running linters however it makes obvious sense now and gives me the insight about how Claude Code works allowing me to use this insight in related engineering work.
Exactly. I've been using the chat gpt desktop app not because of the model quality but because of the UX. It basically seamlessly integrates with my IDEs (intellij and vs code). Mostly I just do stuff like select a few lines, hit option+shift+1, and say something like "fix this". Nice short prompt and I get the answer relatively quickly. Option+shift+1 opens chat gpt with the open file already added to the context. It sees what lines are selected. And it also sees the output of any test runs on the consoles. So just me saying "fix this" now has a rich context that I don't need to micromanage.
Mostly I just use the 4o model instead of the newer better models because it is faster. It's good enough mostly and I prefer getting a good enough answer quickly than the perfect answer after a few minutes. Mostly what I ask is not rocket science so perfect is the enemy of good here. I rarely have to escalate to better models. The reasoning models are annoyingly slow. Especially when they go down the wrong track, which happens a lot.
And my cost is a predictable 20$/month. The downside is that the scope of what I can ask is more limited. I'd like it to be able to "see" my whole code base instead of just 1 file and for me to not have to micro manage what the model looks at. Claude can do that if you don't care about money. But if you do, you are basically micro managing context. That sounds like monkey work that somebody should automate. And it shouldn't require an Einstein sized artificial brain to do that.
There must be people that are experimenting with using locally running more limited AI models to do all the micromanaging that then escalate to remote models as needed. That's more or less what Apple pitched for Apple AI at some point. Sounds like a good path forward. I'd be curious to learn about coding tools that do something like that.
In terms of cost, I don't actually think it's unreasonable to spend a few hundred dollars per month on this stuff. But I question the added value over the 20$ I'm spending. I don't think the improvement is 20x better. more like 1.5x. And I don't like the unpredictability of this and having to think about how expensive a question is going to be.
I think a lot of the short term improvement is going to be a mix of UX and predictable cost. Currently the tools are still very clunky and a bit dumb. The competition is going to be about predictable speed, cost and quality. There's a lot of room for improvement here.
It usually does, just with a time delay and a strict condition that the firm you work at can actually commercialize your productivity. Apply your systems thinking skills to compensation and it will all make sense.
It's interesting that this is a problem for people because I have never spent more than about $0.50 on a task with Claude Code. I have pretty good code hygiene and I tell Claude what to do with clear instructions and guidelines, and Claude does it. I will usually go through a few revisions and then just change anything myself if I find it not quite working. It's exactly like having an eager intern.
I don't think about controlling cost because I price my time at US$40/h and virtually all models are cheaper than that (with the exception of o1 or Gemini 2.5 pro).
If I spend $2 instead of $0.50 on a session but I had to spend 6 minutes thinking about context, I haven't gained any money.
If your expectation is to produce the same amount of output, you could argue when paying for AI tools, you're choosing to spend money to gain free time.
4 hours coding project X or 3 hours and a short hike with your partner / friends etc
If what I'm doing doesn't have a positive expected value, the correct move isn't to use inferior dev tooling to save money, it's to stop working on it entirely.
I assume they use a conversation, so if you compress the prompt immediately you should only break cache once, and still hit cache on subsequent prompts?
I pretty much one shot a scraper from an old Joomla site with 200+ articles to a new WP site, including all users and assets, and converting all the PDFs to articles. It cost me like $3 in tokens.
I guess the question the is: can't VScode Copilot do the same for a fixed $20/month? It even has access to all SOTA models like Claude 3.7, Gemini 2.5 Pro and GPT o3
Vscode’s agent mode in copilot (even in the insider’s nightly) is a bit rough in my experience: lots of 500 errors, stalls, and outright failures to follow tasks (as if there’s a mismatch between what the ui says it will include in context vs what gets fed to the LLM).
I would have thought so, but somehow no. I have a cursor subscription with access to all of those models, and I still consistently get better results from claude code.
no it's a few hundred lines of python to parse weird and inconsistent HTML into json files and CSV files, and then a sync script that can call the WP API to create all the authors as needed, update the articles, and migrate the images
Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT
Yesterday I gave up and disabled my format-on-save config within VSCode. It was burning way too many tokens with unnecessary file reads after failed diffs. The LLMs still have a decent number of failed diffs, but it helps a lot.
Some tools take more effort to hold properly than others. I'm not saying there's not a lot of room for improvement - or that the ux couldn't hold the users hand more to force things like this in some "assisted mode" but at the end of the day, it's a thin, useful wrapper around an llm, and llms require effort to use effectively.
I definitely get value out of it- more than any other tool like it that I've tried.
Think about what you would do in an unfamiliar project with no context and the ticket
"please fix the authorization bug in /api/users/:id".
You'd start by grepping the code base and trying to understand it.
Compare that to, "fix the permission in src/controllers/users.ts in the function `getById`. We need to check the user in the JWT is the same user that is being requested"
On a shorter timeline than you'd think none of working with these tools will look like this.
You'll be prompting and evaluating and iterating entirely finished pieces of software and be able to see multiple attempts at each solve at once, none of this deep in the weeds fixing a bug stuff.
We're rapidly approaching a world where a lot of software will be being made without an engineer hire at all, maybe not the hardest most complex or novel software but a lot of software that previously required a team of 3-15 wont have a single dev.
> So, AIs are overeager junior developers at best, and not the magical programmer replacements they are advertised as.
This may be a quick quip or a rant. But the things we say have a way of reinforcing how we think. So I suggest refining until what we say cuts to the core of the matter. The claim above is a false dichotomy. Let's put aside advertisements and hype. Trying to map between AI capabilities and human ones is complicated. There is high quality writing on this to be found. I recommend reading literature reviews on evals.
Don’t be a dismissive dick; that’s not appropriate for this forum. The above post is clearly trying to engage thoughtfully and offers genuinely good advice.
I’m thinking you might be a kind of person that requires very direct feedback. Your flagged comment was unkind and unhelpful. Your follow-up response seems to suggest that you were justified in being rude?
You also mischaracterize my comment two levels up. It didn’t wave you away by saying “just google it”. It said — perhaps not directly enough — that your comment was off track and gave you some ideas to consider and directions to explore.
> There is high quality writing on this to be found. I recommend reading literature reviews on evals.
This is, quite literally, "just google it".
And yes, I prefer direct feedback, not vague philosophical and pseudo-philosophical statements and vague references. I'm sure there's high quality writing to be found on this, too.
We have very different ideas of what "literal" means. You _interpreted_ what I wrote as "just Google it". I didn't say those words verbatim _nor_ do I mean that. Use a search engine if you want to find some high-quality papers. Or use Google Scholar. Or go straight to Arxiv. Or ask people on a forum.
> not vague philosophical and pseudo-philosophical statements and vague references
If you stop being so uncharitable, more people might be inclined to engage you. Try to interpret what I wrote as constructive criticism.
Shall we get back to the object level? You wrote:
> AIs are overeager junior developers at best
Again, I'm saying this isn't a good framing. I'm asking you to consider you might be wrong. You don't need to hunker down. You don't need to counter-attack. Instead, you could do more reading and research.
> We have very different ideas of what "literal" means. You _interpreted_ what I wrote as "just Google it". I didn't say those words verbatim _nor_ do I mean that. Use a search engine if you want to find some high-quality papers. Or use Google Scholar. Or go straight to Arxiv. Or ask people on a forum.
Aka "I will make some vague references to some literature, go Google it"
> Instead, you could do more reading and research.
Instead of vague "just google it", and vague ad hominems you could actually provide constructive feedback.
The grandparent is talking about how to control cost by focusing the tool. My response was to a comment about how that takes too much thinking.
If you give a junior an overly broad prompt, they are going to have to do a ton of searching and reading to find out what they need to do. If you give them specific instructions, including files, they are more likely to get it right.
I never said they were replacements. At best, they're tools that are incredibly effective when used on the correct type of problem with the right type of prompt.
I have been quite skeptical of using AI tools and my experiences using them have been frustrating for developing software but power tools usually come with a learning curve while "good product" with clean simplified interface often results in reduced capability.
VIM, Emacs and Excel are obvious power tools which may require you to think but often produce unrivalled productivity for power users
So I don't think the verdict that the product has a bad UI is fair. Natural language interfaces is such a step up from old school APIs with countless flags and parameters
Mh. Like, I'm deeply impressed what these AI assistants can do by now. But, the list in the parent comment there is very similar to my mental check-list of pair-programming / pair-admin'ing with less experienced people.
I guess "context length" in AIs is what I intuitively tracked with people already. It can be a struggle to connect the Zabbix alert, the ticket and the situation on the system already, even if you don't track down all the zabbix code and scripts. And then we throw in Ansible configuring the thing, and then the business requriements by more, or less controlled dev-teams. And then you realize dev is controlled by impossible sales-terms.
These are scope -- or I guess context -- expansions that cause people to struggle.
GitHub copilot follows your context perfectly. I don't have to tell it anything about files. I tried this initially and it just screwed up the results.
> GitHub copilot follows your context perfectly. I don't have to tell it anything about files. I tried this initially and it just screwed up the results.
Just to make sure we're on the same page. There are two things in play. First, a language model's ability to know what file you are referring to. Second, an assistant's ability to make sure the right file is in the context window. In your experience, how does Claude Code compare to Copilot w.r.t (1) and (2)?
I've developed a new mental model of the LLM codebase automation solutions. These are effectively identical to outsourcing your product to someone like Infosys. From an information theory perspective, you need to communicate approximately the same amount of things in either case.
Tweaking claude.md files until the desired result is achieved is similar to a back and forth email chain with the contractor. The difference being that the contractor can be held accountable in our human legal system and can be made to follow their "prompt" very strictly. The LLM has its own advantages, but they seem to be a subset since the human contractor can also utilize an LLM.
Those who get a lot of uplift out of the models are almost certainly using them in a cybernetic manner wherein the model is an integral part of an expert's thinking loop regarding the program/problem. Defining a pile of policies and having the LLM apply them to a codebase automatically is a significantly less impactful use of the technology than having a skilled human developer leverage it for immediate questions and code snippets as part of their normal iterative development flow.
If you've got so much code that you need to automate eyeballs over it, you are probably in a death spiral already. The LLM doesn't care about the terrain warnings. It can't "pull up".
We, mere humans, communicate our needs poorly, and undervisualize until we see concrete results. This is the state of us.
Faced with us as a client, the LLM has infinite patience at linear but marginal cost (relative to your thinking/design time cost, and the value of instant iteration as you realize what you meant to picture and say).
With offshoring, telling them they're getting it wrong is not just horrifically slow thanks to comms and comprehension latency, it makes you a problem client, until soon you'll find the do-over cost becomes neither linear nor marginal.
Don't sleep on the power of small fast iterations (not vibes, concrete iterations), with an LLM tool that commits as you go and can roll back both code and mental model when you're down a garden path.
The benefit of doing it like this is that I also get to learn from the LLM. It will surprise me from time to time about things I didn't know and it gives me a chance to learn and get better as well.
So I have been using Cursor a lot more in a vibe code way lately and I have been coming across what a lot of people report: sometimes the model will rewrite perfectly working code that I didn't ask it to touch and break it.
In most cases, it is because I am asking the model to do too much at once. Which is fine, I am learning the right level of abstraction/instruction where the model is effective consistently.
But when I read these best practices, I can't help but think of the cost. The multiple CLAUDE.md files, the files of context, the urls to documentation, the planning steps, the tests. And then the iteration on the code until it passes the test, then fixing up linter errors, then running an adversarial model as a code review, then generating the PR.
It makes me want to find a way to work at Anthropic so I can learn to do all of that without spending $100 per PR. Each of the steps in that last paragraph is an expensive API call for us ISV and each requires experimentation to get the right level of abstraction/instruction.
I want to advocate to Anthropic for a scholarship program for devs (I'd volunteer, lol) where they give credits to Claude in exchange for public usage. This would be structured similar to creator programs for image/audio/video gen-ai companies (e.g. runway, kling, midjourney) where they bring on heavy users that also post to social media (e.g. X, TikTok, Twitch) and they get heavily discounted (or even free) usage in exchange for promoting the product.
Why do you think it's supposed to be cheap? Developers are expensive. Claude doesn't have to be cheap to make software development quicker and cheaper. It just has to be cheaper than you.
There are ways to use LLMs cheaply, but it will always be expensive to get the most out of them. In fact, the top end will only get more and more costly as the lengths of tasks AIs can successfully complete grows.
I am not implying in any sense a value judgement on cost. I'm stating my emotions at the realization of the cost and how that affects my ability to use the available tools in my own education.
It would be no different than me saying "it sucks university is so expensive, I wish I could afford to go to an expensive college but I don't have a scholarship" and someone then answers: why should it be cheap.
So, allow me the space to express my feelings and propose alternatives, of which scholarships are one example and creative programs are another. Another one I didn't mention would be the same route as universities force now: I could take out a loan. And I could consider it an investment loan with the idea it will pay back either in employment prospects or through the development of an application that earns me money. Other alternatives would be finding employment at a company willing to invest that $100/day through me, the limit of that alternative being working at an actual foundational model company for presumably unlimited usage.
And of course, I could focus my personal education on squeezing the most value for the least cost. But I believe the balance point between slightly useful and completely transformative usages levels is probably at a higher cost level than I can reasonably afford as an independent.
There's an ocean of B2B SaaS services that would save customers money compared to building poor imitations in-house. Despite the Joel Test (almost 25 years old! craxy...) asking whether you buy your developers the best tools that money can buy, because they're almost invariably cheaper than developer salaries, the fact remains that most companies treat salaries as a fixed cost and everything else threatens the limited budget they have.
Anybody who has ever tried to sell developer tooling knows, you're competing with free/open-source solutions, and it aint a fair fight.
> So I have been using Cursor a lot more in a vibe code way lately and I have been coming across what a lot of people report: sometimes the model will rewrite perfectly working code that I didn't ask it to touch and break it.
I don't find this particularly problematic because I can quickly see the unnecessary changes in git and revert them.
Like, I guess it would be nice if I didn't have to do that, but compared to the value I'm getting it's not a big deal.
I agree with this in the general sense but of course I would like to minimize the thrash.
I have become obsessive about doing git commits in the way I used to obsess over Ctrl-S before the days of source control. As soon as I get to a point I am happy, I get the LLM to do a check-point check in so I can minimize the cost of doing a full directory revert.
But from a time and cost perspective, I could be doing much better. I've internalized the idea that when the LLM goes off the rails it was my fault. I should have prompted it better. So I am now consider: how do I get better faster? And the answer is I do it as much as I can to learn.
I don't just want to whine about the process. I want to use that frustration to help me improve, while avoiding going bankrupt.
i think this is particularly claude 3.7 behavior - at least in my experience, it's ... eager. overeager. smarter than 3."6" but still, it has little chill. gemini is better; o3 better yet. I'm mostly off claude as a daily driver coding assistant, but it had a really long run - longest so far.
I haven't used Aider yet, but I see it show up on HN frequently recently (the last couple of days specifically).
I am hesitant because I am paying for Cursor now and I get a lot of model usage included within that monthly cost. I'm cheap, perhaps to a fault even when I could afford it, and I hate the idea of spending twice when spending once is usually enough. So while Aider is potentially cheaper than Claude Code, it is still more than what I am already paying.
I would appreciate any comments on people who have made the switch from Cursor to Aider. Are you paying more/less? If you are paying more, do you feel the added value is worth the additional cost? If you are paying less, do you feel you are getting less, the same or even more?
So I feel like a grandpa reading this. I gave Claude code a solid shot. Had some wins but costs started blowing up. I switched to Gemini AI where I only upload files I want it to work on and make sure to refactor often so modularity remains fairly high. It's an amazing experience. If this is any measure - I've been averaging about 5-6 "small features" per 10k tokens. And I totally suck at fe coding!! The other interesting aspect of doing it this way is being able to break up problems and concerns. For example in this case I only worked on fe without any backend and flushed it out before starting on an backend.
A combination that works nicely to solve bugs is: 1) have Gemini analyze the code and the problem, 2) ask it to create a prompt for Claude to fix the problem, 3) give Claude the markdown prompt and the code, 4) give Gemini the output from Claude to review, 5) repeat if necessary
I like Gemini for "architect" roles, it has very good code recall (almost no hallucinations, or none lately), so it can successfully review code edits by Claude. I also find it useful to ground it with Google Search.
Gemini's context is very long, so I can feed it full files. I do the same with Claude, but I may need to start from scratch various times, so Gemini serves as memory (and is also good that Gemini has almost no hallucinations, so it's great as a code reviewer for Claude's edits).
Yeah right now I just dump entire files at it too. After 100-200k tokens I just restart. Not because the LLM is hallucinating (which it is not) but I like the feeling of treating each restart as a new feature checkpoint.
Claude Code works fairly well, but Anthropic has lost the plot on the state of market competition. OpenAI tried to buy Cursor and now Windsurf because they know they need to win market share, Gemini 2.5 pro is better at coding than their Sonnet models, has huge context and runs on their TPU stack, but somehow Anthropic is expecting people to pay $200 in API costs per functional PR costs to vibe code. Ok.
The issue with many of these tips is that they require you use to claude code (or codex cli, doesn't matter) to spend way more time in it, feed it more info, generate more outputs --> pay more money to the LLM provider.
I find LLM-based tools helpful, and use them quite regularly but not 20 bucks+, let alone 100+ per month that claude code would require to be used effectively.
what happened to the "$5 is just a cup o' coffee" argument? Are we heading towards the everything-for-$100 land?
On a serious note, there is no clear evidence that any of the LLM-based code assistants will contribute to saving developer time. Depends on the phase of the project you are in and on a multitude of factors.
I'm a skeptical adopter of new tech. But I cut my teeth on LLMs a couple years ago when I was dropped into a project using an older framework I wasn't familiar with. Even back then, LLMs helped me a ton to get familiar with the project and use best practices when I wasn't sure what those were.
And that was just copy & past into ChatGPT.
I don't know about assistants or project integration. But, in my experience, LLMS are a great tool to have and worth learning how to use well, for you. And I think that's the key part. Some people like heavily integrated IDEs, some people prefer a more minimal approach with VS Code or Vim.
I think LLMs are going to be similar. Some people are going to want full integration and some are just going to want minimal interface, context, and edits. It's going to be up to the dev to figure out what works best for him or her.
While I agree, I find the early phases to be the least productive use of my time as it’s often a lot of boilerplate and decisions that require thought but turn to matter very little. Paying $100 to bootstrap to midlife on a new idea seems absurdly cheap given my hourly.
So sad that people are happy to spend 100$ pd on a tool like this, and we're so unlikely (in general) to pay $5 to an author of an article/blog posts that possibly saved you the same amount of time.
(I'm not judging a specific person here, this is more of a broad commentary regarding our relationship/sense of responsibility/entitlement/lack of empathy when it comes to supporting other people's work when it helps us)
No, it doesn't. If you are still looking for product market fit, it is just cost.
After 2 years of GPT4 release, we can safely say that LLMs don't make finding PMF that much easier nor improve general quality/UX of products, as we still see a general enshittification trend.
If this spending was really game-changing, ChatGPT frontend/apps wouldn't be so bad after so long.
Finding product market fit is a human directional issue, and LLMs absolutely can help speed up iteration time here. I’ve built two RoR MVPs for small hobbby projects spending ~$75 in Claude code to make something in a day that would have previously taken me a month plus. Again, absolutely bizarre that people can’t see the value here, even as these tools are still working through their kinks.
The most interesting part of this article for me was:
> Have multiple checkouts of your repo
I don’t know why this never occurred to me probably because it feels wrong to have multiple checkouts, but it makes sense so that you can keep each AI instance running at full speed. While LLM‘s are fast, this is one of the annoying parts of just waiting for an instance of Aider or Claude Code to finish something.
Also, I had never heard of git worktrees, that’s pretty interesting as well and seems like a good way to accomplish effectively having multiple checkouts.
I've never used Claude Code or other CLI-based agents. I use Cursor a lot to pair program, letting the AI do the majority of the work but actively guiding.
How do you keep tabs on multiple agents doing multiple things in a codebase? Is the end deliverable there a bunch of MRs to review later? Or is it a more YOLO approach of trusting the agents to write the code and deploy with no human in the loop?
Multiple terminal sessions. Well written prompts and CLAUDE.md files.
I like to start by describing the problem and having it do research into what it should do, writing to a markdown file, then get it to implement the changes. You can keep tabs on a few different tasks at a time and you don't need to approve Yolo mode for writes, to keep the cost down and the model going wild.
What's the Gemini equivalent of Claude Code and OpenAI's Codex? I've found projects like reugn/gemini-cli, but Gemini Code Assist seems limited to VS Code?
There's Aider, Plandex and Goose, all of which let you chose various providers and models. Aider also has a well known benchmark[0] that you can check out to help select models.
I think the differences between Aider / Plandex are more obvious. However I'd love to see a comparison breakdown between Plandex and Goose which seem to occupy a very similar space.
I would also like to know — I think people are using Cursor/Windsurf/Roo(Cline) for IDEs that let you pick the model, but I don't know of a CLI agentic editor that lets you use arbitrary models.
Hey, I'm the creator of Plandex (https://github.com/plandex-ai/plandex), which takes a more agentic approach than aider, and combines models from Anthropic, OpenAI, and Google. You might find it interesting.
Isn't this bad that every model company is making their own version of the IDE level tool?
Wasn't it clearly bad when facebook would get real close to buying another company... then decide naw, we got developers out the ass lets just steal the idea and put them out of business
I mostly work in neovim, but I'll open cursor to write boilerplate code. I'd love to use something cli based like Claude Code or Codex, but neither of them implement semantic indexing (vector embeddings) the way Cursor does. It should be possible to implement an MCP server which does this, but I haven't found a good one.
I use a small plugin I’ve written my self to interact with Claude, Gemini 2.5 pro or GPT. I’ve not really seen the need for semantic searching yet. Instead I’ve given the LLM access to LSP symbol search, grep and the ability to add files to the conversation. It’s been working well for my use cases but I’ve never tried Cursor so I can’t comment on how it compares. I’m sure it’s not as smooth though. I’ve tried some of the more common Neovim plugins and for me it works better, but the preference here is very personal. If you want to try it out it’s here: https://github.com/isaksamsten/sia.nvim
Tool-calling agents with search tools do very well at information retrieval tasks in codebases. They are slower and more expensive than good RAG (if you amortize the RAG index over many operations), but they're incredibly versatile and excel in many cases where RAG would fall down. Why do you think you need semantic indexing?
Unfortunately I can only give an anecdotal answer here, but I get better results from Cursor than the alternatives. The semantic index is the main difference, so I assume that's what's giving it the edge.
Is it a very large codebase? Anything else distinctive about it? Are you often asking high-level/conceptual questions? Those are the questions that would help me understand why you might be seeing better results with RAG.
I use Claude Code. I read the discussion here, and given the criticism, proceeded to try some of the other solutions that people recommended.
After spending a couple of hours trying to get aider and plandex to run (and then with Google Gemini 2.5 pro), my conclusion is that these tools have a long way to go until they are usable. The breakage is all over the place. Sure, there is promise, but today I simply can't get them to work reasonably. And my time is expensive.
Claude Code just works. I run it (even in a slightly unsupported way, in a Docker container on my mac) and it works. It does stuff.
PS: what is it with all "modern" tools asking you to "curl somewhere.com/somescript.sh | bash". Seriously? Ship it in a docker container if you can't manage your dependencies.
I recently wrote a big blog post on my experience spending about $200 with Claude Code to "vibecode" some major feature enhancements for my image gallery site mood.site
I'm wondering how much of the techniques described in this blog post can be used in an IDE like Windsurf or Cursor with Claude Sonnet?
My 2 cents on value for money and effectiveness of Claude vs Gemini for coding:
I've been using Windsurf, VS Code and the new Firebase Studio. The Windsurf subscription allowance for $15 per month seems adequate for reasonable every day use. I find Claude Sonnet 3.7 performs better for me than Gemini 2.5 pro experimental.
I still like VS Code and its way of doing things, you can do a lot with the standard free plan.
With Firebase Studio, my take is that it should good for building and deploying simple things that don't require much developer handholding.
Yep I learned this the hard way after racking up big bills just using Sonnet 3.7 in my IDE. Gemini is just as good (and not nearly as willing to agree with every dumb thing I say) and it’s way cheaper.
Your gemini pricing is for flash, not pro. Also, claude uses prompt caching and gemini currently does not. The pricing isn't super straightforward because of that.
I don't see how this is a best practice then. It seems like they are saying "Spend money on something easy to do, but can be catastrophic if the AI screws it up."
I have not yet used Claude Code personally, but I believe that Cursor and Windsurf both optimize token usage by limiting how much of your code each prompt analyzes.
With Claude Code, all bets are off there. You get a better understanding of your code in each prompt, and the bill can rack up, what, 50x faster?
If I got really stuck on a problem involving many lines of code, I could see myself spinning up Claude Code for that one issue, then quickly going back to Windsurf.
The only problem is that this loss is permanent! As far as I can tell, there's no way to go back to the old conversation after a `/clear`.
I had one session last week where Claude Code seemed to have become amazingly capable and was implementing entire new features and fixing bugs in one-shot, and then I ran `/clear` (by accident no less) and it suddenly became very dumb.
You can ask it to store its current context to a file, review the file, ask it to emphasize or de-emphasize things based on your review, and then use `/clear`.
Then, you can edit the file at your leisure if you want to.
And when you want to load that context back in, ask it to read the file.
Works better than `/compact`, and is a lot cheaper.
Edit: It so happens I had a Claude Code session open in my Terminal, so I asked it:
Save your current context to a file.
Claude produced a 91 line md file... surely that's not the whole of its context? This was a reasonably lengthy conversation in which the AI implemented a new feature.
Apologies for the late reply. My kids demanded my attention yesterday.
It doesn't seem to have included any points on style or workflow in the context. Most of my context documents end up including the following information:
1. I want the agent to treat git commits as checkpoints so that we can revert really silly changes it makes.
2. I want it to keep on running build/tests on the code to be sure it isn't just going completely off the rails.
3. I want it to refrain from adding low signal comments to the code. And not use emojis.
4. I want it to be honest in its dealings with me.
It goes on a bit from there. I suspect the reason that the models end up including that information in the context documents they dump in our sessions is that I give them such strong (and strongly worded) feedback on these topics.
As an alternative, I wonder what would happen if you just told it what was missing from the context and asked it to re-dump the context to file.
But none of this is really Claude Code's internal context, right? It's a summary. I could see using it as an alternative to /compact but not to undo a /clear.
Whatever the internal state is of Claude Code, it's lost as soon as you /clear or close the Terminal window. You can't even experiment with a different prompt and then--if you don't like the prompt--go back to the original conversation, because pressing esc to branch the conversation looses the original branch.
I'm excited for the improvements they've had recently but I have better luck with Cline in regular vs code, as well as cursor.
I've tried Claude code this week and I really didn't like it - Claude did an okay job but was insistent on deleting some shit and hard coding a check instead of an actual conditional. It got the feature done in about $3, but I didn't really like the user experience and it didn't feel any better than using 3.7 in cursor.
If anyone from Anthropic is reading this, your billing for Claude Code is hostile to your users.
Why doesn’t Claude Code usage count against the same plan that usage of Claude.ai and Claude Desktop are billed against?
I upgraded to the $200/month plan because I really like Claude Code but then was so annoyed to find that this upgrade didn’t even apply to my usage of Claude Code.
This would put anthropic in the business of minimizing the context to increase profits, same as Cursor and others who cheap out on context and try to RAG etc. Which would quickly make it worse, so I hope they stay on api pricing
Some base usage included in the plan might be a good balance
I’ve been using codemcp (https://github.com/ezyang/codemcp) to get “most” of the functionality of Claude code (I believe it uses prompts extracted from Claude Code), but using my existing pro plan.
It’s less autonomous, since it’s based on the Claude chat interface, and you need to write “continue” every so often, but it’s nice to save the $$
Claude.ai/Desktop is priced based on average user usage. If you have 1 power user sending 1000 requests per day, and 99 sending 5, many even none, you can afford having a single $10/month plan for everyone to keep things simple.
But every Claude Code user is a 1000 requests per day user, so the economics don't work anymore.
I would accept a higher-priced plan (which covered both my use of Claude.ai/Claude Desktop AND my use of Claude Code).
Anthropic make it seem like Claude Code is a product categorized like Claude Desktop (usage of which gets billed against your Claude.ai plan). This is how it signs off all its commits:
Generated with [Claude Code](https://claude.ai/code)
At the very least, this is misleading. It misled me.
Once I had purchased the $200/month plan, I did some reading and quickly realized that I had been too quick to jump to conclusions. It still left me feeling like they had pulled a fast on one me.
Maybe you can cancel your subscription or charge back?
I think it's just oversight on their part. They have nothing to gain by making people believe they would get Claude Code access through their regular plans, only bad word of mouth.
Well, take that into consideration then. Just make it an option. Instead of getting 1000 requests per day with code, you get 100 on the $10/month plan, and then let users decide whether they want to migrate to a higher tier or continue using the API model.
I am not saying Claude should stop making money, I'm just advocating for giving users the value of getting some Code coverage when you migrate from the basic plan to the pro or max.
I totally agree with this, I would rather have some kind of prediction than using the Claude Code roulette. I would definitely upgrade my plan if I got Claude Code usage included.
I don't what you guys are on about but I have been using the free GitHub Copilot in VS Code chats to absolutely crank out new UI features in Vue. All that stuff that makes you groan at the thought of it: more divs, bindings, form validation, a whole new widget...churned out in 30 seconds. Try it live. Works? Keep.
I'm surprised at the complexity and correctness at which it infers from very simple, almost inadequate, prompts.
Claude Pro and other website/desktop subscription plans are subject to usage limits that would make it very difficult to use for Claude Code.
Claude Code uses the API interface and API pricing, and writes and edits code directly on your machine, this is a level past simply interacting with a separate chat bot. It seems a little disingenuous to say it's "hostile" to users, when the reality is yeah, you do pay a bit more for more reliable usage tier, for a task that requires it. It also shows you exactly how much it's spent at any point.
No, that's the whole point: predictability. It's definitely a trade off, but if we could save the work as is we could have the option to continue the iteration elsewhere, or even better, from that point on offer the option to fallback to the current API model.
A nice addition would be having something like /cost but to check where you are in regards to limits.
The writing of edits and code directly on my machine is something that happens on the client side. I don't see why that usage would be subject to anything but one-time billing or how it puts any strain on Anthropic's infrastructure.
$200/month isn't that much. Folks, I'm hanging around with are spending $100 USD to $500 USD daily as the new norm as a cost of doing business and remaining competitive. That might seem expensive, but it's cheap... https://ghuntley.com/redlining
$100/day seems reasonable as an upper-percentile spend per programmer. $500/day sounds insane.
A 2.5 hour session with Claude Code costs me somewhere between $15 and $20. Taking $20/2.5 hours as the estimate, $100 would buy me 12.5 hours of programming.
Asking very specific questions to Sonnet 3.7 costs a couple of tenths of a cent every time, and even if you're doing that all day it will never amount to more than maybe a dollar at the end of the day.
On average, one line of, say, JavaScript represents around 7 tokens, which means there are around 140k lines of JS per million tokens.
On Openrouter, Sonnet 3.7 costs are currently:
- $3 / one million input tokens => $100 = 33.3 million input tokens = 420k lines of JS code
- $15 / one million output tokens => $100 = 3.6 million output tokens = 4.6 million lines of JS code
For one developer? In one day? It seems that one can only reach such amounts if the whole codebase is sent again as context with each and every interaction (maybe even with every keystroke for type completion?) -- and that seems incredibly wasteful?
I can't edit the above comment, but there's obviously an error in the math! ;-) Doesn't change the point I was trying to make, but putting this here for the record.
33.3 million input tokens / 7 tokens per loc = 4.8 million locs
3.6 million output tokens / 7 tokens per loc = 515k locs
That's how it works, everything is recomputed again every additional prompt. But it can cache the state of things and restore for a lower fee, and reingesting what was formerly output is cheaper than making new output (serial bottleneck) so sometimes there is a discount there.
I'm waiting for the day this AI bubble bursts since as far as we can tell almost all these AI "providers" are operating at a loss. I wonder if this billing model actually makes profit or if it's still just burning cash in hopes of AGI being around the corner. We have yet to see a product that is useful and affordable enough to justify the cost.
It sounds insane until you drive full agentic loops/evals. I'm currently making a self-compiling compiler; no doubt you'll hear/see about it soon. The other night, I fell asleep and woke up with interface dynamic dispatch using vtables with runtime type information and generic interface support implemented...
Well, you can’t just vibe code something useful into existence despite all the marketing. You have to be very intentional about which libraries it can use, code style etc. Make sure it has the proper specifications and context. And review the code, of course.
Consider L5 at Google: outgoings of $377,797 USD per year just on salary/stock, before fixed overheads such as insurance, leave, issues like ramp-up time and cost of their manager. In the hands of a Staff+ engineer, these tools enable replication of Staff+ engineers and don't sleep. My 2c: the funding for the new norm will come from either compressing the manager layer or engineering layer or both.
These tools and foundational models get better every day, and right now, they enable Staff+ engineers and businesses to have less need for juniors. I suspect there will be [short-to-medium-term] compression. See extended thoughts at https://ghuntley.com/screwed
They do, but I’ve seen a huge slowdown in “getting better” in the last year. I wonder if it’s my perception, or reality. Each model does better on benchmarks but I’m still experiencing at least a 50% failure rate on _basic_ task completion, and that number hasn’t moved higher in many months.
I wonder what will happen first - will companies move to LLMs, or to programmers from abroad (because ultimately, it will be cheaper than using LLMs - you've said ~$500 per day, in Poland ~$1500 will be a good monthly wage - and that still will make us expensive! How about moving to India, then? Nigeria? LATAM countries?)
The minimum wage in Poland is around USD 1240/month. The median wage in Poland is approximately USD 1648/month. Tech salaries are considerably higher than the median.
Idk, maybe for an intern software developer it's a good salary...
Minimal is ~$930 after taxes, though; I rarely see people talk here about salary pre-tax, tbh.
~$1200 is what I'd get paid here after a few years of experience; I have never saw an internship offer in my city that paid more than minimal wage (most commonly, it's unpaid).
The industry has tried that, and the problems are well known (timezones, unpredictable outcomes in terms of quality and delivery dates)...
Delivery via LLMs is predictable, fast, and any concerns about outcome [quality] can be programmed away to reject bad outcomes. This form of programming the LLMs has a one-time cost...
Oh but they absolutely do. Have you not used any of this llm tooling? It’s insanely good once you learn how to employ it. I no longer need a front end team, for example. It's that good at TypeScript and React. And the design is even better.
Do you have a link to some of this output? A repo on Github of something you’ve done for fun?
I get a lot of value out of LLMs but when I see people make claims like this I know they aren’t “in the trenches” of software development, or care so little about quality that I can’t relate to their experience.
Usually they’re investors in some bullshit agentic coding tool though.
I will shortly; am building a serious self-compiling compiler rn out of an brand-new esoteric language. Meaning the LLM is able to program itself without training data about the programming language...
Honestly, I don't know what to make of it. Stage 2 is almost complete, and I'm (right now) conducting per-language benchmarks to compare it to the Titans.
Using the proper techniques, Sonet 3.7 can generate code in the custom lexical/stdlib. So, in my eyes, the path to Stage 3 is unlocked, but it will chew lots and lots of tokens.
Well, virtually every production-grade compiler is self-compiling. Since you bring it up explicitly, I'm wondering what implications of begin self-compiling you have in mind?
> Meaning the LLM is able to program itself without training data about the programming language...
Could you clarify this sentence a bit? Does it mean the LLM will code in this new language without training in it before hand? Or is it going to enable the LLM to programm itself to gain some new capabilities?
Frankly, with the advent of coding agents, building a new compiler sounds about as relevant as introducing a new flavor of assembly language and then a new assembly may at least be justified by a new CPU architecture...
The "ultrathink" thing is pretty funny:
> We recommend using the word "think" to trigger extended thinking mode, which gives Claude additional computation time to evaluate alternatives more thoroughly. These specific phrases are mapped directly to increasing levels of thinking budget in the system: "think" < "think hard" < "think harder" < "ultrathink." Each level allocates progressively more thinking budget for Claude to use.
I had a poke around and it's not a feature of the Claude model, it's specific to Claude Code. There's a "megathink" option too - it uses code that looks like this:
Notes on how I found that here: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...Not gonna lie: the "ultrathink" keyword that Sonnet 3.7 with thinking tokens watches for gives me "doubleplusgood" vibes in a hilarious but horrifying way.
A little bit of the old ultrathink with the boys
Shot to everyone around a table, thinking furiously over their glasses of milk
At this point should we get our first knob/slider on a language model... THINK
..as if we're operating this machine as analog synth
If you use any of the more direct API sandbox/studio UIs, there are already various sliders, temperature (essentially randomness vs. predictability) being the most common.
The consumer-facing chatbot interfaces just hide all that because they're aiming for a non-technical audience.
I use a cheap MIDI controller in this manner - there is even native browser support. Great to get immediate feedback on parameter tweaks
Maybe a Turbo Think button that toggles between Ultrathink and Megathink.
There are already many such adjustable parameters such as temperature and top_k
That's awesome, and almost certainly an Unreal Tournament reference (when you chain enough kills in short time it moves through a progression that includes "megakill" and "ultrakill").
If they did, they left out the best one: "m-m-m-m-monsterkill"
Surely Anthropic could do a better job implementing dynamic thinking token budgets.
Ultrakill is from Quake :)
It is not. Quake had "Excellent" for two kills in short succession, but nothing else if you chained kills after that.
In aider, instead of “ultrathink” you would say:
Or, shorthand:Slightly shameless, but easier than typing a longer reply.
https://www.paritybits.me/think-toggles-are-dumb/
https://nilock.github.io/autothink/
LLMs with broad contextual capabilities shouldn't need to be guided in this manor. Claude can tell a trivial task from a complex one just as easily as I can, and should self-adjust, up to thresholds of compute spending, etc.
>LLMs with broad contextual capabilities shouldn't need to be guided in this manor.
I mean finding your way around a manor can be hard, it's easier in an apartment.
Waiting until I can tell it to use "galaxy brain".
What I don’t like about Claude Code is why can’t they give command line flags for this stuff? It’s better documented and people don’t have to discover this the hard way.
Similarly, I do miss an —add command line flag to manual specify the context (files) during the session. Right now I pretty much end up copy pasting the relative paths from VSCode and supply to Claude. Aider has much better semantics for such stuff.
Maybe I’m not getting this, but you can tab to autocomplete file paths.
You can use English or —add if you want to tell Claude to reference them.
Weird code to have in a modern AI system!
Also 14 string scans seems a little inefficient!
14 checks through a string is entirely negligible relative to the amount of compute happening. Like a drop of water in the ocean.
Everybody says this all the time. But it compounds. And then our computers struggle with what should be basic websites.
“think hard with a vengeance”
Surprised that "controlling cost" isn't a section in this post. Here's my attempt.
---
If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I would not be surprised if you're seeing high cost.
Be aware of the "cache".
Tell it to read specific files (and only those!), if you don't, it'll read unnecessary files, or repeatedly read sections of files or even search through files.
Avoid letting it search - even halt it. Find / rg can have a thousands of tokens of output depending on the search.
Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.
The cache also goes away after 5-15 minutes or so (not sure) - so avoid leaving sessions open and coming back later.
Never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).
Don't let files get too big (it's good hygiene too) to keep the context window sizes smaller.
Have a clear goal in mind and keep sessions to as few messages as possible.
Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question. I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).
If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated). For hobby stuff, it adds up - totally.
For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).
If I have to be so cautious while using a tool might as well write the code myself lol. I’ve used Claude Code extensively and it is one of the best AI IDE. It just gets things done. The only downside is the cost. I was averaging $35-$40/day. At this cost, I’d rather just use Cursor/Windsurf.
Oh wow. Reading your comment guarantees I'll never use Claude Code.
I use Aider. It's awesome. You explicitly specify the files. You don't have to do work to limit context.
Not having to specify files is a humongous feature for me. Having to remember which file code is in is half the work once you pass a certain codebase size.
Use /context <prompt> to have aider automatically add the files based on the prompt. It's been working well for me.
That sometimes work sometimes doesn’t and takes 10x time. Same with codex. I would have both and switch between them depending on what you feel will get it right better
Yeah, I tried CC out and quickly noticed it was spending $5+ for simple LLM capable tasks. I rarely break $1-2 a session using aider. Aider feels like more of a precision tool. I like having the ability to manually specify.
I do find Claude Code to be really good at exploration though - like checking out a repository I'm unfamiliar with and then asking questions about it.
After switching to Aider, I realized the other tools have been playing elaborate games to choose cheaper models and to limit files and messages in context, both of which increase their bills.
Aider is a great tool. I do love it. But I find I have to do more with it to get the same output as Claude Code (no matter what LLM I used with Aider). Sure it may end up being cheaper per run, but not when my time is factored in. The flip side is I find Aider much easier to limit.
What are those extra things you have to do more of? I only have experience with Aider so I am curious what I am missing here.
With Claude Code you can at least type "/code" at any point to see how much it's spent, and it will show you when you end a session (with Ctrl+C) too.
The output of /cost looks like this:
Aider shows how much you've spent after each command :-). It shows the cost of the command as well as the session.
>I use Aider. It's awesome.
What do you use for the model? Claude? Gemini? o3?
Currently using Sonnet 3.7, but mostly because I've been too lazy to set up an account with Google.
Get an Openrouter account and you can play with almost all providers, I was burning money on Claude, tried V3 (blocked Deepseek provider for being flaky, let the laypeople mock them) and experimental and GA Gemini models.
Gemini 2.5 pro is my choice
The productivity boost can be so massive that this amount of fiddling to control costs is counterproductive.
Developers tend to seriously underestimate the opportunity cost of their own time.
Hint - it’s many multiples of your total compensation broken down to 40 hour work weeks.
The cost of the task scales with how long it takes, plus or minus.
Substitute “cost” with “time” in the above post and all of the same tips are still valuable.
I don’t do much agentic LLM coding but the speed (or lack thereof) was one of my least favorite parts. Using any tricks that narrow scope, prevent reprocessing files over and over again, or searching through the codebase are all helpful even if you don’t care about the dollar amount.
Hard agree. Whether it's 50 cents or 10 dollars per session, I'm using it to get work done for the sake of quickly completing work that aims to unblock many orders of magnitude more value. But in so far as cheaper correct sessions correlate with sessions where the problem solving was more efficient anyhow, they're fairly solid tips.
I agree but optimisation often reveals implementation details helping to understand limits of current tech more. It might not be worth the time but part of engineering is optimisation and another part is deep understanding of tech. It is sometimes worth optimising anyway if you want to take the engineering discipline to the next level within yourself.
I myself didn’t think about not running linters however it makes obvious sense now and gives me the insight about how Claude Code works allowing me to use this insight in related engineering work.
Exactly. I've been using the chat gpt desktop app not because of the model quality but because of the UX. It basically seamlessly integrates with my IDEs (intellij and vs code). Mostly I just do stuff like select a few lines, hit option+shift+1, and say something like "fix this". Nice short prompt and I get the answer relatively quickly. Option+shift+1 opens chat gpt with the open file already added to the context. It sees what lines are selected. And it also sees the output of any test runs on the consoles. So just me saying "fix this" now has a rich context that I don't need to micromanage.
Mostly I just use the 4o model instead of the newer better models because it is faster. It's good enough mostly and I prefer getting a good enough answer quickly than the perfect answer after a few minutes. Mostly what I ask is not rocket science so perfect is the enemy of good here. I rarely have to escalate to better models. The reasoning models are annoyingly slow. Especially when they go down the wrong track, which happens a lot.
And my cost is a predictable 20$/month. The downside is that the scope of what I can ask is more limited. I'd like it to be able to "see" my whole code base instead of just 1 file and for me to not have to micro manage what the model looks at. Claude can do that if you don't care about money. But if you do, you are basically micro managing context. That sounds like monkey work that somebody should automate. And it shouldn't require an Einstein sized artificial brain to do that.
There must be people that are experimenting with using locally running more limited AI models to do all the micromanaging that then escalate to remote models as needed. That's more or less what Apple pitched for Apple AI at some point. Sounds like a good path forward. I'd be curious to learn about coding tools that do something like that.
In terms of cost, I don't actually think it's unreasonable to spend a few hundred dollars per month on this stuff. But I question the added value over the 20$ I'm spending. I don't think the improvement is 20x better. more like 1.5x. And I don't like the unpredictability of this and having to think about how expensive a question is going to be.
I think a lot of the short term improvement is going to be a mix of UX and predictable cost. Currently the tools are still very clunky and a bit dumb. The competition is going to be about predictable speed, cost and quality. There's a lot of room for improvement here.
If this is true, why isn't our compensation scaling with the increases in productivity?
It usually does, just with a time delay and a strict condition that the firm you work at can actually commercialize your productivity. Apply your systems thinking skills to compensation and it will all make sense.
It's interesting that this is a problem for people because I have never spent more than about $0.50 on a task with Claude Code. I have pretty good code hygiene and I tell Claude what to do with clear instructions and guidelines, and Claude does it. I will usually go through a few revisions and then just change anything myself if I find it not quite working. It's exactly like having an eager intern.
I don't think about controlling cost because I price my time at US$40/h and virtually all models are cheaper than that (with the exception of o1 or Gemini 2.5 pro).
If I spend $2 instead of $0.50 on a session but I had to spend 6 minutes thinking about context, I haven't gained any money.
Important to remind people this is only true if you have a profitable product, otherwise you’re spending money you haven’t earned.
If your expectation is to produce the same amount of output, you could argue when paying for AI tools, you're choosing to spend money to gain free time.
4 hours coding project X or 3 hours and a short hike with your partner / friends etc
If what I'm doing doesn't have a positive expected value, the correct move isn't to use inferior dev tooling to save money, it's to stop working on it entirely.
There might be value but you might not receive any of it. Most salaried employees won't see returns.
Come on, every hobby has negative expected value. You're not doing it for the money but it still makes sense to save money.
If you do it a bit, it just becomes habit / no extra time or cognitive load.
Correlation or causation aside, the same people I see complain about cost, complain about quality.
It might indicate more tightly controlled sessions may also produce better results.
Or maybe it's just people that tend to complain about one thing, complain about another.
I assume they use a conversation, so if you compress the prompt immediately you should only break cache once, and still hit cache on subsequent prompts?
So instead of Write Hit Hit Hit
It's Write Write Hit Hit Hit
My attempt is - Do not use Claude Code at all, it is terrible tool. It is bad at almost everything starting with making simple edits to files.
And most of all Claude Code is overeager to start messing with your code and run unnecessary $$ instead of making sensible plan.
This isn't problem with Claude Sonnet - it is fundamnetal problem with Claude Code.
I pretty much one shot a scraper from an old Joomla site with 200+ articles to a new WP site, including all users and assets, and converting all the PDFs to articles. It cost me like $3 in tokens.
I guess the question the is: can't VScode Copilot do the same for a fixed $20/month? It even has access to all SOTA models like Claude 3.7, Gemini 2.5 Pro and GPT o3
Vscode’s agent mode in copilot (even in the insider’s nightly) is a bit rough in my experience: lots of 500 errors, stalls, and outright failures to follow tasks (as if there’s a mismatch between what the ui says it will include in context vs what gets fed to the LLM).
I would have thought so, but somehow no. I have a cursor subscription with access to all of those models, and I still consistently get better results from claude code.
I haven't tried copilot. Mostly because I don't use VSCode, I use jetbrains ides. How do they provide Claude 3.7 for $20/mo with unlimited usage?
Copilot has a pretty good plugin for JetBrains IDEs!
Though their own AI Assistant and Junie might be equally good choices there too.
By providing bad UI that you don't use it so much.
was it a wget call feeding into html2pdf?
no it's a few hundred lines of python to parse weird and inconsistent HTML into json files and CSV files, and then a sync script that can call the WP API to create all the authors as needed, update the articles, and migrate the images
Plumbing to pipe shit from one sewer to another.
Yep, don't wanna spend more of my life doing that than I have to!
Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT
Yesterday I gave up and disabled my format-on-save config within VSCode. It was burning way too many tokens with unnecessary file reads after failed diffs. The LLMs still have a decent number of failed diffs, but it helps a lot.
If I have to spend this much time thinking about any of this, congratulations, you’ve designed a product with a terrible UI.
Some tools take more effort to hold properly than others. I'm not saying there's not a lot of room for improvement - or that the ux couldn't hold the users hand more to force things like this in some "assisted mode" but at the end of the day, it's a thin, useful wrapper around an llm, and llms require effort to use effectively.
I definitely get value out of it- more than any other tool like it that I've tried.
Think about what you would do in an unfamiliar project with no context and the ticket
"please fix the authorization bug in /api/users/:id".
You'd start by grepping the code base and trying to understand it.
Compare that to, "fix the permission in src/controllers/users.ts in the function `getById`. We need to check the user in the JWT is the same user that is being requested"
So, AIs are overeager junior developers at best, and not the magical programmer replacements they are advertised as.
Let's split the difference and call them "magical overeager junior developer replacements".
On a shorter timeline than you'd think none of working with these tools will look like this.
You'll be prompting and evaluating and iterating entirely finished pieces of software and be able to see multiple attempts at each solve at once, none of this deep in the weeds fixing a bug stuff.
We're rapidly approaching a world where a lot of software will be being made without an engineer hire at all, maybe not the hardest most complex or novel software but a lot of software that previously required a team of 3-15 wont have a single dev.
My current estimate is mid 2026
my current estimate is 2030. because we can barely get a JS/TS application to compile after a year of dependency updates.
our current popular stack is quicksand.
unless we're talking about .net core, java, Django and more of these stable platforms.
> So, AIs are overeager junior developers at best, and not the magical programmer replacements they are advertised as.
This may be a quick quip or a rant. But the things we say have a way of reinforcing how we think. So I suggest refining until what we say cuts to the core of the matter. The claim above is a false dichotomy. Let's put aside advertisements and hype. Trying to map between AI capabilities and human ones is complicated. There is high quality writing on this to be found. I recommend reading literature reviews on evals.
[flagged]
Don’t be a dismissive dick; that’s not appropriate for this forum. The above post is clearly trying to engage thoughtfully and offers genuinely good advice.
The above post produces some vague philosophical statements, and equally vague "juts google it" claims.
I’m thinking you might be a kind of person that requires very direct feedback. Your flagged comment was unkind and unhelpful. Your follow-up response seems to suggest that you were justified in being rude?
You also mischaracterize my comment two levels up. It didn’t wave you away by saying “just google it”. It said — perhaps not directly enough — that your comment was off track and gave you some ideas to consider and directions to explore.
> There is high quality writing on this to be found. I recommend reading literature reviews on evals.
This is, quite literally, "just google it".
And yes, I prefer direct feedback, not vague philosophical and pseudo-philosophical statements and vague references. I'm sure there's high quality writing to be found on this, too.
We have very different ideas of what "literal" means. You _interpreted_ what I wrote as "just Google it". I didn't say those words verbatim _nor_ do I mean that. Use a search engine if you want to find some high-quality papers. Or use Google Scholar. Or go straight to Arxiv. Or ask people on a forum.
> not vague philosophical and pseudo-philosophical statements and vague references
If you stop being so uncharitable, more people might be inclined to engage you. Try to interpret what I wrote as constructive criticism.
Shall we get back to the object level? You wrote:
> AIs are overeager junior developers at best
Again, I'm saying this isn't a good framing. I'm asking you to consider you might be wrong. You don't need to hunker down. You don't need to counter-attack. Instead, you could do more reading and research.
> We have very different ideas of what "literal" means. You _interpreted_ what I wrote as "just Google it". I didn't say those words verbatim _nor_ do I mean that. Use a search engine if you want to find some high-quality papers. Or use Google Scholar. Or go straight to Arxiv. Or ask people on a forum.
Aka "I will make some vague references to some literature, go Google it"
> Instead, you could do more reading and research.
Instead of vague "just google it", and vague ad hominems you could actually provide constructive feedback.
The grandparent is talking about how to control cost by focusing the tool. My response was to a comment about how that takes too much thinking.
If you give a junior an overly broad prompt, they are going to have to do a ton of searching and reading to find out what they need to do. If you give them specific instructions, including files, they are more likely to get it right.
I never said they were replacements. At best, they're tools that are incredibly effective when used on the correct type of problem with the right type of prompt.
> If you give a junior an overly broad prompt, they are going to have to do a ton of
> they're tools that are incredibly effective when used on the correct type of problem with the right type of prompt.
So, a junior developer who has to be told exactly what to do.
As for the "correct type of problem with the right type of prompt", what exactly are those?
As of April 2025. The pace is so fast that it will overtake seniors within years maybe months.
That's been said since at least 2021 (the release date for GitHub Copilot). I think you're overestimating the pace.
overtake ceo by 2026
I have been quite skeptical of using AI tools and my experiences using them have been frustrating for developing software but power tools usually come with a learning curve while "good product" with clean simplified interface often results in reduced capability.
VIM, Emacs and Excel are obvious power tools which may require you to think but often produce unrivalled productivity for power users
So I don't think the verdict that the product has a bad UI is fair. Natural language interfaces is such a step up from old school APIs with countless flags and parameters
Mh. Like, I'm deeply impressed what these AI assistants can do by now. But, the list in the parent comment there is very similar to my mental check-list of pair-programming / pair-admin'ing with less experienced people.
I guess "context length" in AIs is what I intuitively tracked with people already. It can be a struggle to connect the Zabbix alert, the ticket and the situation on the system already, even if you don't track down all the zabbix code and scripts. And then we throw in Ansible configuring the thing, and then the business requriements by more, or less controlled dev-teams. And then you realize dev is controlled by impossible sales-terms.
These are scope -- or I guess context -- expansions that cause people to struggle.
It's fundamentally hard. If you have an easy solution, you can go make a easy few billion dollars.
GitHub copilot follows your context perfectly. I don't have to tell it anything about files. I tried this initially and it just screwed up the results.
> GitHub copilot follows your context perfectly. I don't have to tell it anything about files. I tried this initially and it just screwed up the results.
Just to make sure we're on the same page. There are two things in play. First, a language model's ability to know what file you are referring to. Second, an assistant's ability to make sure the right file is in the context window. In your experience, how does Claude Code compare to Copilot w.r.t (1) and (2)?
I've developed a new mental model of the LLM codebase automation solutions. These are effectively identical to outsourcing your product to someone like Infosys. From an information theory perspective, you need to communicate approximately the same amount of things in either case.
Tweaking claude.md files until the desired result is achieved is similar to a back and forth email chain with the contractor. The difference being that the contractor can be held accountable in our human legal system and can be made to follow their "prompt" very strictly. The LLM has its own advantages, but they seem to be a subset since the human contractor can also utilize an LLM.
Those who get a lot of uplift out of the models are almost certainly using them in a cybernetic manner wherein the model is an integral part of an expert's thinking loop regarding the program/problem. Defining a pile of policies and having the LLM apply them to a codebase automatically is a significantly less impactful use of the technology than having a skilled human developer leverage it for immediate questions and code snippets as part of their normal iterative development flow.
If you've got so much code that you need to automate eyeballs over it, you are probably in a death spiral already. The LLM doesn't care about the terrain warnings. It can't "pull up".
We, mere humans, communicate our needs poorly, and undervisualize until we see concrete results. This is the state of us.
Faced with us as a client, the LLM has infinite patience at linear but marginal cost (relative to your thinking/design time cost, and the value of instant iteration as you realize what you meant to picture and say).
With offshoring, telling them they're getting it wrong is not just horrifically slow thanks to comms and comprehension latency, it makes you a problem client, until soon you'll find the do-over cost becomes neither linear nor marginal.
Don't sleep on the power of small fast iterations (not vibes, concrete iterations), with an LLM tool that commits as you go and can roll back both code and mental model when you're down a garden path.
Intriguing perspective! Could you elaborate on this with another paragraph or two?
> We humans undervisualize until we see concrete results.
> > We humans undervisualize until we see concrete results.
> Could you elaborate on this with another paragraph or two?
Volunteer as a client-facing PdM at a digital agency for a week*, you'll be able to elaborate with a book.
* Well, long enough to try to iterate a client instruction based deliverable.
This matches well with my experience so far. It’s why the chat interface has remained my preference over autocomplete in an IDE.
The benefit of doing it like this is that I also get to learn from the LLM. It will surprise me from time to time about things I didn't know and it gives me a chance to learn and get better as well.
> These are effectively identical to outsourcing your product to someone like Infosys.
But in my experience, the user has to be better than an Infosys employee to know how to convey the task to the LLM and then verify iteratively.
So more like an experienced engg outsourcing work to a service company engg.
That’s exactly what they were saying.
So I have been using Cursor a lot more in a vibe code way lately and I have been coming across what a lot of people report: sometimes the model will rewrite perfectly working code that I didn't ask it to touch and break it.
In most cases, it is because I am asking the model to do too much at once. Which is fine, I am learning the right level of abstraction/instruction where the model is effective consistently.
But when I read these best practices, I can't help but think of the cost. The multiple CLAUDE.md files, the files of context, the urls to documentation, the planning steps, the tests. And then the iteration on the code until it passes the test, then fixing up linter errors, then running an adversarial model as a code review, then generating the PR.
It makes me want to find a way to work at Anthropic so I can learn to do all of that without spending $100 per PR. Each of the steps in that last paragraph is an expensive API call for us ISV and each requires experimentation to get the right level of abstraction/instruction.
I want to advocate to Anthropic for a scholarship program for devs (I'd volunteer, lol) where they give credits to Claude in exchange for public usage. This would be structured similar to creator programs for image/audio/video gen-ai companies (e.g. runway, kling, midjourney) where they bring on heavy users that also post to social media (e.g. X, TikTok, Twitch) and they get heavily discounted (or even free) usage in exchange for promoting the product.
Why do you think it's supposed to be cheap? Developers are expensive. Claude doesn't have to be cheap to make software development quicker and cheaper. It just has to be cheaper than you.
There are ways to use LLMs cheaply, but it will always be expensive to get the most out of them. In fact, the top end will only get more and more costly as the lengths of tasks AIs can successfully complete grows.
I am not implying in any sense a value judgement on cost. I'm stating my emotions at the realization of the cost and how that affects my ability to use the available tools in my own education.
It would be no different than me saying "it sucks university is so expensive, I wish I could afford to go to an expensive college but I don't have a scholarship" and someone then answers: why should it be cheap.
So, allow me the space to express my feelings and propose alternatives, of which scholarships are one example and creative programs are another. Another one I didn't mention would be the same route as universities force now: I could take out a loan. And I could consider it an investment loan with the idea it will pay back either in employment prospects or through the development of an application that earns me money. Other alternatives would be finding employment at a company willing to invest that $100/day through me, the limit of that alternative being working at an actual foundational model company for presumably unlimited usage.
And of course, I could focus my personal education on squeezing the most value for the least cost. But I believe the balance point between slightly useful and completely transformative usages levels is probably at a higher cost level than I can reasonably afford as an independent.
> It just has to be cheaper than you
There's an ocean of B2B SaaS services that would save customers money compared to building poor imitations in-house. Despite the Joel Test (almost 25 years old! craxy...) asking whether you buy your developers the best tools that money can buy, because they're almost invariably cheaper than developer salaries, the fact remains that most companies treat salaries as a fixed cost and everything else threatens the limited budget they have.
Anybody who has ever tried to sell developer tooling knows, you're competing with free/open-source solutions, and it aint a fair fight.
> It just has to be cheaper than you.
Not when you need an SWE in order for it to work successfully.
general public, ceo, vc consensus is that - if it can understand english, anyone can do it. crazy
> So I have been using Cursor a lot more in a vibe code way lately and I have been coming across what a lot of people report: sometimes the model will rewrite perfectly working code that I didn't ask it to touch and break it.
I don't find this particularly problematic because I can quickly see the unnecessary changes in git and revert them.
Like, I guess it would be nice if I didn't have to do that, but compared to the value I'm getting it's not a big deal.
I agree with this in the general sense but of course I would like to minimize the thrash.
I have become obsessive about doing git commits in the way I used to obsess over Ctrl-S before the days of source control. As soon as I get to a point I am happy, I get the LLM to do a check-point check in so I can minimize the cost of doing a full directory revert.
But from a time and cost perspective, I could be doing much better. I've internalized the idea that when the LLM goes off the rails it was my fault. I should have prompted it better. So I am now consider: how do I get better faster? And the answer is I do it as much as I can to learn.
I don't just want to whine about the process. I want to use that frustration to help me improve, while avoiding going bankrupt.
i think this is particularly claude 3.7 behavior - at least in my experience, it's ... eager. overeager. smarter than 3."6" but still, it has little chill. gemini is better; o3 better yet. I'm mostly off claude as a daily driver coding assistant, but it had a really long run - longest so far.
I get the same with gemini, though. o3 is kind of the opposite, under-eager. I cannot really decide on my favorite. So I switch back and forth :)
That's why I like Aider.
You can protect your files in a non-AI way: by simply not giving write access to Aider.
Also, apparently Aider is a bit more economic with tokens than other tools.
I haven't used Aider yet, but I see it show up on HN frequently recently (the last couple of days specifically).
I am hesitant because I am paying for Cursor now and I get a lot of model usage included within that monthly cost. I'm cheap, perhaps to a fault even when I could afford it, and I hate the idea of spending twice when spending once is usually enough. So while Aider is potentially cheaper than Claude Code, it is still more than what I am already paying.
I would appreciate any comments on people who have made the switch from Cursor to Aider. Are you paying more/less? If you are paying more, do you feel the added value is worth the additional cost? If you are paying less, do you feel you are getting less, the same or even more?
With Aider you pay API fees only. You can get simple tasks done for a few dollars. I suggest budgeting $20 or so dollars and giving it a go.
As an Aider user who has never tried Cursor, I’d also be interested in hearing from any Aider users who are using Cursor and how it compares.
So I feel like a grandpa reading this. I gave Claude code a solid shot. Had some wins but costs started blowing up. I switched to Gemini AI where I only upload files I want it to work on and make sure to refactor often so modularity remains fairly high. It's an amazing experience. If this is any measure - I've been averaging about 5-6 "small features" per 10k tokens. And I totally suck at fe coding!! The other interesting aspect of doing it this way is being able to break up problems and concerns. For example in this case I only worked on fe without any backend and flushed it out before starting on an backend.
A combination that works nicely to solve bugs is: 1) have Gemini analyze the code and the problem, 2) ask it to create a prompt for Claude to fix the problem, 3) give Claude the markdown prompt and the code, 4) give Gemini the output from Claude to review, 5) repeat if necessary
If you like this plan, you can do this from the command line:
`aider --model gemini --architect --editor-model claude-3.7` and aider will take care of all the fiddly bits including git commits for you.
right now `aider --model o3 --architect` has the highest rating on the Aider leaderboards, but it costs wayyy more than just --model gemini.
I like Gemini for "architect" roles, it has very good code recall (almost no hallucinations, or none lately), so it can successfully review code edits by Claude. I also find it useful to ground it with Google Search.
Damn that's interesting. How much of the code do you provide? I'm guessing when modularity is high you can give specific files.
Gemini's context is very long, so I can feed it full files. I do the same with Claude, but I may need to start from scratch various times, so Gemini serves as memory (and is also good that Gemini has almost no hallucinations, so it's great as a code reviewer for Claude's edits).
Yeah right now I just dump entire files at it too. After 100-200k tokens I just restart. Not because the LLM is hallucinating (which it is not) but I like the feeling of treating each restart as a new feature checkpoint.
by fe the poster means FE (front-end)
Sorry yes. I should have clarified that.
Or uppercase would have cleared it up.
Claude Code works fairly well, but Anthropic has lost the plot on the state of market competition. OpenAI tried to buy Cursor and now Windsurf because they know they need to win market share, Gemini 2.5 pro is better at coding than their Sonnet models, has huge context and runs on their TPU stack, but somehow Anthropic is expecting people to pay $200 in API costs per functional PR costs to vibe code. Ok.
> but somehow Anthropic is expecting people to pay $200 in API costs per functional PR costs to vibe code. Ok.
Reading the thread, somehow people are paying. It is mindblowing how in place of getting cheaper, development just got more expensive for businesses.
$200 per PR is significantly cheaper development than businesses are paying.
In terms of short-term outlay, perhaps. But don't forget to factor in the long-term benefits of having a human team involved.
3.5 was amazing for code, and topped benchmarks for months. It'll take a while for other models to take over that mental space.
The issue with many of these tips is that they require you use to claude code (or codex cli, doesn't matter) to spend way more time in it, feed it more info, generate more outputs --> pay more money to the LLM provider.
I find LLM-based tools helpful, and use them quite regularly but not 20 bucks+, let alone 100+ per month that claude code would require to be used effectively.
Interesting, I have $100 days with Claude Code. Beyond effective.
> let alone 100+ per month that claude code would require
I find this argument very bizarre. $100 is pay for 1-2 hours of developer time. Doesn't it save at least that much time in a whole month?
what happened to the "$5 is just a cup o' coffee" argument? Are we heading towards the everything-for-$100 land?
On a serious note, there is no clear evidence that any of the LLM-based code assistants will contribute to saving developer time. Depends on the phase of the project you are in and on a multitude of factors.
I'm a skeptical adopter of new tech. But I cut my teeth on LLMs a couple years ago when I was dropped into a project using an older framework I wasn't familiar with. Even back then, LLMs helped me a ton to get familiar with the project and use best practices when I wasn't sure what those were.
And that was just copy & past into ChatGPT.
I don't know about assistants or project integration. But, in my experience, LLMS are a great tool to have and worth learning how to use well, for you. And I think that's the key part. Some people like heavily integrated IDEs, some people prefer a more minimal approach with VS Code or Vim.
I think LLMs are going to be similar. Some people are going to want full integration and some are just going to want minimal interface, context, and edits. It's going to be up to the dev to figure out what works best for him or her.
While I agree, I find the early phases to be the least productive use of my time as it’s often a lot of boilerplate and decisions that require thought but turn to matter very little. Paying $100 to bootstrap to midlife on a new idea seems absurdly cheap given my hourly.
So sad that people are happy to spend 100$ pd on a tool like this, and we're so unlikely (in general) to pay $5 to an author of an article/blog posts that possibly saved you the same amount of time.
(I'm not judging a specific person here, this is more of a broad commentary regarding our relationship/sense of responsibility/entitlement/lack of empathy when it comes to supporting other people's work when it helps us)
No, it doesn't. If you are still looking for product market fit, it is just cost.
After 2 years of GPT4 release, we can safely say that LLMs don't make finding PMF that much easier nor improve general quality/UX of products, as we still see a general enshittification trend.
If this spending was really game-changing, ChatGPT frontend/apps wouldn't be so bad after so long.
Finding product market fit is a human directional issue, and LLMs absolutely can help speed up iteration time here. I’ve built two RoR MVPs for small hobbby projects spending ~$75 in Claude code to make something in a day that would have previously taken me a month plus. Again, absolutely bizarre that people can’t see the value here, even as these tools are still working through their kinks.
And how much did these two MVPs make in sales?
If they just helped you to ship something valueless, you paid $75 for entertainment, like betting.
You can now do 30 MVPs in a month instead of just one.
Reminds me of https://www.reddit.com/r/comics/comments/d1sm26/behold_the_u...
Enshittification is the result of shitty incentives in the market not because coding is hard
Just a few days ago Cursor saved a lot of developer time by encouraging all the customers to quit using a product.
https://news.ycombinator.com/item?id=43683012
Developer time "saved" indeed ;-)
The most interesting part of this article for me was:
> Have multiple checkouts of your repo
I don’t know why this never occurred to me probably because it feels wrong to have multiple checkouts, but it makes sense so that you can keep each AI instance running at full speed. While LLM‘s are fast, this is one of the annoying parts of just waiting for an instance of Aider or Claude Code to finish something.
Also, I had never heard of git worktrees, that’s pretty interesting as well and seems like a good way to accomplish effectively having multiple checkouts.
You might want to consider Claude Squad, https://github.com/smtg-ai/claude-squad which manages all the worktrees for you.
Disclaimer, I haven’t tried it personally - if you do, let us know how you go!
I've never used Claude Code or other CLI-based agents. I use Cursor a lot to pair program, letting the AI do the majority of the work but actively guiding.
How do you keep tabs on multiple agents doing multiple things in a codebase? Is the end deliverable there a bunch of MRs to review later? Or is it a more YOLO approach of trusting the agents to write the code and deploy with no human in the loop?
Multiple terminal sessions. Well written prompts and CLAUDE.md files.
I like to start by describing the problem and having it do research into what it should do, writing to a markdown file, then get it to implement the changes. You can keep tabs on a few different tasks at a time and you don't need to approve Yolo mode for writes, to keep the cost down and the model going wild.
In the same way how you manage a group of brilliant interns.
Really? My LLMs seem entirely uninterested in free snacks and unlimited vacation.
What's the Gemini equivalent of Claude Code and OpenAI's Codex? I've found projects like reugn/gemini-cli, but Gemini Code Assist seems limited to VS Code?
There's Aider, Plandex and Goose, all of which let you chose various providers and models. Aider also has a well known benchmark[0] that you can check out to help select models.
- Aider - https://aider.chat/ | https://github.com/Aider-AI/aider
- Plandex - https://plandex.ai/ | https://github.com/plandex-ai/plandex
- Goose - https://block.github.io/goose/ | https://github.com/block/goose
[0] https://aider.chat/docs/leaderboards/
I've only user aider (which I like quite a bit more than cursor) but I'm curious how it compares to plandex and goose.
Hi, creator of Plandex here. In case it's helpful, I posted a comment listing some of the main differences with aider here: https://news.ycombinator.com/item?id=43728977
I think the differences between Aider / Plandex are more obvious. However I'd love to see a comparison breakdown between Plandex and Goose which seem to occupy a very similar space.
I would also like to know — I think people are using Cursor/Windsurf/Roo(Cline) for IDEs that let you pick the model, but I don't know of a CLI agentic editor that lets you use arbitrary models.
https://aider.chat/
Thanks! Any others, or any thoughts you can share on it?
Hey, I'm the creator of Plandex (https://github.com/plandex-ai/plandex), which takes a more agentic approach than aider, and combines models from Anthropic, OpenAI, and Google. You might find it interesting.
I did a Show HN for it a few days ago: https://news.ycombinator.com/item?id=43710576
Junie from Jetbrains was recently released. Not sure what LLM is uses.
Claude
Isn't this bad that every model company is making their own version of the IDE level tool?
Wasn't it clearly bad when facebook would get real close to buying another company... then decide naw, we got developers out the ass lets just steal the idea and put them out of business
I mostly work in neovim, but I'll open cursor to write boilerplate code. I'd love to use something cli based like Claude Code or Codex, but neither of them implement semantic indexing (vector embeddings) the way Cursor does. It should be possible to implement an MCP server which does this, but I haven't found a good one.
I use a small plugin I’ve written my self to interact with Claude, Gemini 2.5 pro or GPT. I’ve not really seen the need for semantic searching yet. Instead I’ve given the LLM access to LSP symbol search, grep and the ability to add files to the conversation. It’s been working well for my use cases but I’ve never tried Cursor so I can’t comment on how it compares. I’m sure it’s not as smooth though. I’ve tried some of the more common Neovim plugins and for me it works better, but the preference here is very personal. If you want to try it out it’s here: https://github.com/isaksamsten/sia.nvim
Tool-calling agents with search tools do very well at information retrieval tasks in codebases. They are slower and more expensive than good RAG (if you amortize the RAG index over many operations), but they're incredibly versatile and excel in many cases where RAG would fall down. Why do you think you need semantic indexing?
> Why do you think you need semantic indexing?
Unfortunately I can only give an anecdotal answer here, but I get better results from Cursor than the alternatives. The semantic index is the main difference, so I assume that's what's giving it the edge.
Is it a very large codebase? Anything else distinctive about it? Are you often asking high-level/conceptual questions? Those are the questions that would help me understand why you might be seeing better results with RAG.
I'll ask something like "where does X happen?" But "X" isn't mentioned anywhere in the code because the code is a complete nightmare.
Good point. I largely work in Zed -- looks like it had semantic search for a while but is working on a redesign https://github.com/zed-industries/zed/issues/9564
I use Claude Code. I read the discussion here, and given the criticism, proceeded to try some of the other solutions that people recommended.
After spending a couple of hours trying to get aider and plandex to run (and then with Google Gemini 2.5 pro), my conclusion is that these tools have a long way to go until they are usable. The breakage is all over the place. Sure, there is promise, but today I simply can't get them to work reasonably. And my time is expensive.
Claude Code just works. I run it (even in a slightly unsupported way, in a Docker container on my mac) and it works. It does stuff.
PS: what is it with all "modern" tools asking you to "curl somewhere.com/somescript.sh | bash". Seriously? Ship it in a docker container if you can't manage your dependencies.
I recently wrote a big blog post on my experience spending about $200 with Claude Code to "vibecode" some major feature enhancements for my image gallery site mood.site
https://kylekukshtel.com/vibecoding-claude-code-cline-sonnet...
Would definitely recommend people reading it for some insight into hands on experience with the tool.
I'm wondering how much of the techniques described in this blog post can be used in an IDE like Windsurf or Cursor with Claude Sonnet?
My 2 cents on value for money and effectiveness of Claude vs Gemini for coding:
I've been using Windsurf, VS Code and the new Firebase Studio. The Windsurf subscription allowance for $15 per month seems adequate for reasonable every day use. I find Claude Sonnet 3.7 performs better for me than Gemini 2.5 pro experimental.
I still like VS Code and its way of doing things, you can do a lot with the standard free plan.
With Firebase Studio, my take is that it should good for building and deploying simple things that don't require much developer handholding.
well, the best practice is to use gemini 2.5 pro instead :)
Yep I learned this the hard way after racking up big bills just using Sonnet 3.7 in my IDE. Gemini is just as good (and not nearly as willing to agree with every dumb thing I say) and it’s way cheaper.
> Gemini is ... way cheaper.
Yep. Here are the API pricing numbers for Gemini vs Claude. All per 1M tokens.
1. Gemini 2.5: in: $0.15; out: $0.60 non-thinking or $3.50 thinking
2. Claude 3.7: in: $3.00; out: $15
[1] https://ai.google.dev/gemini-api/docs/pricing [2] https://www.anthropic.com/pricing#api
Your gemini pricing is for flash, not pro. Also, claude uses prompt caching and gemini currently does not. The pricing isn't super straightforward because of that.
I love Claude Code. It just gets the job done where Cursor (even with Claude Sonnet 3.7) will get lost in changing files without results.
Did anyone have equal results with the „unofficial“ fork „Anon Kode“? Or with Roo Code with Gemini Pro 2.5?
I’m too scared of the cost to use this.
You can set spend limits https://docs.anthropic.com/en/api/rate-limits
>Use Claude to interact with git
Are they saying Claude needs to do the git interaction in order to work and/or will generate better code if it does?
It doesn’t need to. Its optional.
I don't see how this is a best practice then. It seems like they are saying "Spend money on something easy to do, but can be catastrophic if the AI screws it up."
Why do people use Claude Code over e.g. Cursor or Windsurf?
I have not yet used Claude Code personally, but I believe that Cursor and Windsurf both optimize token usage by limiting how much of your code each prompt analyzes.
With Claude Code, all bets are off there. You get a better understanding of your code in each prompt, and the bill can rack up, what, 50x faster?
If I got really stuck on a problem involving many lines of code, I could see myself spinning up Claude Code for that one issue, then quickly going back to Windsurf.
This is a pretty desperate post imho.
> Use /clear to keep context focused
The only problem is that this loss is permanent! As far as I can tell, there's no way to go back to the old conversation after a `/clear`.
I had one session last week where Claude Code seemed to have become amazingly capable and was implementing entire new features and fixing bugs in one-shot, and then I ran `/clear` (by accident no less) and it suddenly became very dumb.
You can ask it to store its current context to a file, review the file, ask it to emphasize or de-emphasize things based on your review, and then use `/clear`.
Then, you can edit the file at your leisure if you want to.
And when you want to load that context back in, ask it to read the file.
Works better than `/compact`, and is a lot cheaper.
Neat, thanks, I had no idea!
Edit: It so happens I had a Claude Code session open in my Terminal, so I asked it:
Claude produced a 91 line md file... surely that's not the whole of its context? This was a reasonably lengthy conversation in which the AI implemented a new feature.What is in the file?
An overview of the project and the features implemented.
Edit: Here's the actual file if you want to see it. https://gist.github.com/Wowfunhappy/e7e178136c47c2589cfa7e5a...
Apologies for the late reply. My kids demanded my attention yesterday.
It doesn't seem to have included any points on style or workflow in the context. Most of my context documents end up including the following information:
1. I want the agent to treat git commits as checkpoints so that we can revert really silly changes it makes.
2. I want it to keep on running build/tests on the code to be sure it isn't just going completely off the rails.
3. I want it to refrain from adding low signal comments to the code. And not use emojis.
4. I want it to be honest in its dealings with me.
It goes on a bit from there. I suspect the reason that the models end up including that information in the context documents they dump in our sessions is that I give them such strong (and strongly worded) feedback on these topics.
As an alternative, I wonder what would happen if you just told it what was missing from the context and asked it to re-dump the context to file.
But none of this is really Claude Code's internal context, right? It's a summary. I could see using it as an alternative to /compact but not to undo a /clear.
Whatever the internal state is of Claude Code, it's lost as soon as you /clear or close the Terminal window. You can't even experiment with a different prompt and then--if you don't like the prompt--go back to the original conversation, because pressing esc to branch the conversation looses the original branch.
Yes, this is true. It's a summary, and cannot really undo a /clear. It is just a directed, cheaper /compact.
Compared to my experience with the free GitHub Copilot in VS Code it sounds like you guys are in a horse and buggy.
I'm excited for the improvements they've had recently but I have better luck with Cline in regular vs code, as well as cursor.
I've tried Claude code this week and I really didn't like it - Claude did an okay job but was insistent on deleting some shit and hard coding a check instead of an actual conditional. It got the feature done in about $3, but I didn't really like the user experience and it didn't feel any better than using 3.7 in cursor.
They've worked to improve this with "memories" (hash symbol to "permanently" record something - you can edit later if you want).
And there's CLAUDE.md. it's like cursorrules. You can also have it modify it's own CLAUDE.md.
This is so helpful!
If anyone from Anthropic is reading this, your billing for Claude Code is hostile to your users.
Why doesn’t Claude Code usage count against the same plan that usage of Claude.ai and Claude Desktop are billed against?
I upgraded to the $200/month plan because I really like Claude Code but then was so annoyed to find that this upgrade didn’t even apply to my usage of Claude Code.
So now I’m not using Claude Code so much.
This would put anthropic in the business of minimizing the context to increase profits, same as Cursor and others who cheap out on context and try to RAG etc. Which would quickly make it worse, so I hope they stay on api pricing
Some base usage included in the plan might be a good balance
You know, I wouldn't mind if they just applied the API pricing after Claude Code ran through the plan limits.
It would definitely get me to use it more.
But the Claude Pro plan is almost certainly priced under the assumption that some users will use it below the usage limit.
If everyone used the plan to the limit, the plan would cost the same as the API with usage equal to the limit.
Claude Code and Claude.ai are separate products.
I’ve been using codemcp (https://github.com/ezyang/codemcp) to get “most” of the functionality of Claude code (I believe it uses prompts extracted from Claude Code), but using my existing pro plan.
It’s less autonomous, since it’s based on the Claude chat interface, and you need to write “continue” every so often, but it’s nice to save the $$
Thanks, makes sense that an MCP server that edits files is a workaround to the problem.
Just tried it and it's indeed very good, thanks for mentioning it! :-)
Claude.ai/Desktop is priced based on average user usage. If you have 1 power user sending 1000 requests per day, and 99 sending 5, many even none, you can afford having a single $10/month plan for everyone to keep things simple.
But every Claude Code user is a 1000 requests per day user, so the economics don't work anymore.
I would accept a higher-priced plan (which covered both my use of Claude.ai/Claude Desktop AND my use of Claude Code).
Anthropic make it seem like Claude Code is a product categorized like Claude Desktop (usage of which gets billed against your Claude.ai plan). This is how it signs off all its commits:
At the very least, this is misleading. It misled me.Once I had purchased the $200/month plan, I did some reading and quickly realized that I had been too quick to jump to conclusions. It still left me feeling like they had pulled a fast on one me.
Maybe you can cancel your subscription or charge back?
I think it's just oversight on their part. They have nothing to gain by making people believe they would get Claude Code access through their regular plans, only bad word of mouth.
To be fair to them, they make it pretty easy to manage the subscription, downgrade it, etc.
This is definitely not malicious on their part. Just bears pointing out.
Well, take that into consideration then. Just make it an option. Instead of getting 1000 requests per day with code, you get 100 on the $10/month plan, and then let users decide whether they want to migrate to a higher tier or continue using the API model.
I am not saying Claude should stop making money, I'm just advocating for giving users the value of getting some Code coverage when you migrate from the basic plan to the pro or max.
Does that make sense?
Their API billing in general is hostile to users. I switched completely to Gemini for this reason and haven’t looked back.
I totally agree with this, I would rather have some kind of prediction than using the Claude Code roulette. I would definitely upgrade my plan if I got Claude Code usage included.
I don't what you guys are on about but I have been using the free GitHub Copilot in VS Code chats to absolutely crank out new UI features in Vue. All that stuff that makes you groan at the thought of it: more divs, bindings, form validation, a whole new widget...churned out in 30 seconds. Try it live. Works? Keep.
I'm surprised at the complexity and correctness at which it infers from very simple, almost inadequate, prompts.
Claude Pro and other website/desktop subscription plans are subject to usage limits that would make it very difficult to use for Claude Code.
Claude Code uses the API interface and API pricing, and writes and edits code directly on your machine, this is a level past simply interacting with a separate chat bot. It seems a little disingenuous to say it's "hostile" to users, when the reality is yeah, you do pay a bit more for more reliable usage tier, for a task that requires it. It also shows you exactly how much it's spent at any point.
> ... usage limits that would make it very difficult to use for Claude Code.
Genuinely interested: how's so?
Well, I think it'd be pretty irritating to see the message "3 messages remaining until 6PM" while you are in the middle of a complex coding task.
Conversely I have to manually do this and monitor the billing instead.
No, that's the whole point: predictability. It's definitely a trade off, but if we could save the work as is we could have the option to continue the iteration elsewhere, or even better, from that point on offer the option to fallback to the current API model.
A nice addition would be having something like /cost but to check where you are in regards to limits.
The writing of edits and code directly on my machine is something that happens on the client side. I don't see why that usage would be subject to anything but one-time billing or how it puts any strain on Anthropic's infrastructure.
Yeah, tried it for a couple of minutes, $0.31, quickly stopped and moved away.
$200/month isn't that much. Folks, I'm hanging around with are spending $100 USD to $500 USD daily as the new norm as a cost of doing business and remaining competitive. That might seem expensive, but it's cheap... https://ghuntley.com/redlining
When should we expect to see the amazing products these super-competitive businesses are developing?
$100/day seems reasonable as an upper-percentile spend per programmer. $500/day sounds insane.
A 2.5 hour session with Claude Code costs me somewhere between $15 and $20. Taking $20/2.5 hours as the estimate, $100 would buy me 12.5 hours of programming.
Asking very specific questions to Sonnet 3.7 costs a couple of tenths of a cent every time, and even if you're doing that all day it will never amount to more than maybe a dollar at the end of the day.
On average, one line of, say, JavaScript represents around 7 tokens, which means there are around 140k lines of JS per million tokens.
On Openrouter, Sonnet 3.7 costs are currently:
- $3 / one million input tokens => $100 = 33.3 million input tokens = 420k lines of JS code
- $15 / one million output tokens => $100 = 3.6 million output tokens = 4.6 million lines of JS code
For one developer? In one day? It seems that one can only reach such amounts if the whole codebase is sent again as context with each and every interaction (maybe even with every keystroke for type completion?) -- and that seems incredibly wasteful?
I can't edit the above comment, but there's obviously an error in the math! ;-) Doesn't change the point I was trying to make, but putting this here for the record.
33.3 million input tokens / 7 tokens per loc = 4.8 million locs
3.6 million output tokens / 7 tokens per loc = 515k locs
That's how it works, everything is recomputed again every additional prompt. But it can cache the state of things and restore for a lower fee, and reingesting what was formerly output is cheaper than making new output (serial bottleneck) so sometimes there is a discount there.
I'm waiting for the day this AI bubble bursts since as far as we can tell almost all these AI "providers" are operating at a loss. I wonder if this billing model actually makes profit or if it's still just burning cash in hopes of AGI being around the corner. We have yet to see a product that is useful and affordable enough to justify the cost.
It's burning cash. Lots of it.
[0] https://www.wheresyoured.at/openai-is-a-systemic-risk-to-the...
Great article, thanks. Mirrors exactly what the JP Morgan/Goldman report claimed but that was quite dated.
It sounds insane until you drive full agentic loops/evals. I'm currently making a self-compiling compiler; no doubt you'll hear/see about it soon. The other night, I fell asleep and woke up with interface dynamic dispatch using vtables with runtime type information and generic interface support implemented...
Do you actually understand the code Claude wrote?
Do you understand all of the code in the libraries that your applications depend on? Or your coworker for that matter?
All of the gate keeping around llm code tools are amusing. But whatever, I’m shipping 10x and making money doing it.
Up until recently I could be sure they were written by a human.
But if you are making money by using LLMs to write code then all power to you. I just despair at the idea of trillions of lines of LLM generated code.
Well, you can’t just vibe code something useful into existence despite all the marketing. You have to be very intentional about which libraries it can use, code style etc. Make sure it has the proper specifications and context. And review the code, of course.
Fair enough. That's pretty cool, I haven't gone that far in my own work with AI yet, but now I am inspired to try.
The point is to get a pipeline working, cost can be optimized down after.
[dead]
Seriously? That’s wild. What kind of CS field could even handle that kind of daily spend for a bunch of people?
Consider L5 at Google: outgoings of $377,797 USD per year just on salary/stock, before fixed overheads such as insurance, leave, issues like ramp-up time and cost of their manager. In the hands of a Staff+ engineer, these tools enable replication of Staff+ engineers and don't sleep. My 2c: the funding for the new norm will come from either compressing the manager layer or engineering layer or both.
LLMs absolutely don't replicate staff+ engineers.
If your staff engineers are mostly doing things AI can do, then you don't need staff. Probably don't even need senior
That's my point.
- L3 SWE II - $193,712 USD (before overheads)
- L4 SWE III - $297,124 USD (before overheads)
- L5 Senior SWE - $377,797 USD (before overheads)
These tools and foundational models get better every day, and right now, they enable Staff+ engineers and businesses to have less need for juniors. I suspect there will be [short-to-medium-term] compression. See extended thoughts at https://ghuntley.com/screwed
> These […] get better every day.
They do, but I’ve seen a huge slowdown in “getting better” in the last year. I wonder if it’s my perception, or reality. Each model does better on benchmarks but I’m still experiencing at least a 50% failure rate on _basic_ task completion, and that number hasn’t moved higher in many months.
I wonder what will happen first - will companies move to LLMs, or to programmers from abroad (because ultimately, it will be cheaper than using LLMs - you've said ~$500 per day, in Poland ~$1500 will be a good monthly wage - and that still will make us expensive! How about moving to India, then? Nigeria? LATAM countries?)
> in Poland ~$1500 will be a good monthly wage
The minimum wage in Poland is around USD 1240/month. The median wage in Poland is approximately USD 1648/month. Tech salaries are considerably higher than the median.
Idk, maybe for an intern software developer it's a good salary...
Minimal is ~$930 after taxes, though; I rarely see people talk here about salary pre-tax, tbh.
~$1200 is what I'd get paid here after a few years of experience; I have never saw an internship offer in my city that paid more than minimal wage (most commonly, it's unpaid).
The industry has tried that, and the problems are well known (timezones, unpredictable outcomes in terms of quality and delivery dates)...
Delivery via LLMs is predictable, fast, and any concerns about outcome [quality] can be programmed away to reject bad outcomes. This form of programming the LLMs has a one-time cost...
Oh but they absolutely do. Have you not used any of this llm tooling? It’s insanely good once you learn how to employ it. I no longer need a front end team, for example. It's that good at TypeScript and React. And the design is even better.
The kind of field where AI builds more in a day than a team or even contract dev does.
correct; utilised correctly these tools ship teams of output in a single day.
Do you have a link to some of this output? A repo on Github of something you’ve done for fun?
I get a lot of value out of LLMs but when I see people make claims like this I know they aren’t “in the trenches” of software development, or care so little about quality that I can’t relate to their experience.
Usually they’re investors in some bullshit agentic coding tool though.
I will shortly; am building a serious self-compiling compiler rn out of an brand-new esoteric language. Meaning the LLM is able to program itself without training data about the programming language...
I would hold on on making grand claims until you have something grand to show for it.
Honestly, I don't know what to make of it. Stage 2 is almost complete, and I'm (right now) conducting per-language benchmarks to compare it to the Titans.
Using the proper techniques, Sonet 3.7 can generate code in the custom lexical/stdlib. So, in my eyes, the path to Stage 3 is unlocked, but it will chew lots and lots of tokens.
> a serious self-compiling compiler
Well, virtually every production-grade compiler is self-compiling. Since you bring it up explicitly, I'm wondering what implications of begin self-compiling you have in mind?
> Meaning the LLM is able to program itself without training data about the programming language...
Could you clarify this sentence a bit? Does it mean the LLM will code in this new language without training in it before hand? Or is it going to enable the LLM to programm itself to gain some new capabilities?
Frankly, with the advent of coding agents, building a new compiler sounds about as relevant as introducing a new flavor of assembly language and then a new assembly may at least be justified by a new CPU architecture...
All can be true depending on the business/person:
1. My company cannot justify this cost at all.
2. My company can justify this cost but I don't find it useful.
3. My company can justify this cost, and I find it useful.
4. I find it useful, and I can justify the cost for personal use.
5. I find it useful, and I cannot justify the cost for personal use.
That aside -- 200/day/dev for a "nice to have service that sometimes makes my work slightly faster" is much in the majority of the world.
[dead]