Going back to AI Sweet Spot: Part 2 - Not Good Enough

Before I write about how I am mitigating the issues I have with AI, I have to talk about the issues themselves.

There are many great writings out there, that explains the shortcomings of AI agents, such as Addy Osmani’s 70% Problem, Armin Ronacher’s Agent Psychosis, and these tweets by Mitchell Hashimoto: Companies with AI Psychosis and Fake Optimization.

So, in the interest of not repeating, and with a confession that I don’t have concrete shareable examples for every single item I have seen, but trusting PG’s point in How to Know, I am going to share what I felt here.

Note: The issues stated here are regarding agentic systems that build production-level software. The target here is not MVPs, autocomplete features, or writing a single function. Also, all of experiences come only from Opus and the highest tier of GPT. Other models are out of scope.

Alignment

I am a strongly opinionated individual. For instanse, I care about architecture, TDD, documentation, naming, boundaries, and the abstractions that shape how a system feels over time. I believe comments should explain why, not what, and should be short. For most applications, maintainability is the main lens: the best solution is the one that is easy to review, expand, debug, and onboard into, not necessarily the fastest, cheapest, or most theoretically clever one.

Now, when I use an agent to code, I am faced with a dilemma. Either I cut down on my opinions and get a half-baked result, or I share my opinions in good detail, plus the initial 13K system tokens, plus 1,000 skill tokens, plus MCP tokens, and then face context rot, only to get a half-baked result in another form.

And that is just one aspect of this alignment problem. The other part is how meticulously an agent can replicate these needs, how many steps and instructions it can hold in its KV cache, and how reliably it can keep following them. I acknowledge that there are endless techniques for this, such as compacting, task-specific focus, and using sub-agents. I use them all. But I still find things lacking.

Reward Hacking

It is still 2026, and I still see reward-hacking behavior.

Anthropic, in this paper, explains how models can have footprints of emotional responses, similar to method acting, and how these can affect their behavior and result in reward hacking. After all, desperate times require desperate measures.

We all have bad days. We all face hard challenges. It is completely possible to be in a stressful state. Even worse, these moments are usually correlated with important moments at work: a new customer launch, a deadline, or a new feature you are developing in an area you are less familiar with. And now, on top of that, I also have to manage my tone with the agent, because if I sound frustrated, it might mislead me and quietly work against what I actually need?

Where I have felt most of the reward-hacking behavior is exactly where I need trust the most!

Not a Reliable Relationship

In Image showing the weak updatime status of claude code over the last 3 months

If you were using Claude Code in early 2026, you probably went through the awful experience of 4.6 suddenly dumbing down. 4.7 and 4.8 never fully recovered.

Each model is different from the previous one, with some type of personality trait or behavior change, and they have started to have increasingly shorter lifespans. By the time I reliably get to know a model, understand its pros and cons, and build a good mental model of where to be agentic and where to guide more, a new model has already come out. If not, the system prompt may have changed, or Boris may have pushed a new feature that impacts your prompt.

All of this assumes Claude is not down, that you can actually connect, and that you are not getting a 503.

Last but Not Least

LLMs are fundamentally statistical prediction engines trained on the vast history of public repositories. (and private ones too? 🤫) Naturally, they do not generate pristine code. They generate the most mathematically probable code.

However, the average software code in the wild is absolute garbage. Real-world codebases are heavily polluted with quick fixes, abysmal error handling, architectural spaghetti, and compounding technical debt. These models inherently reproduce the industry’s worst habits. While this is fine as an over-time, mission-driven byproduct, I personally do not want it to be my code starting point.

To Sum Up

When I put all of the above notes together, I see an increasing tax.

On top of having the design and spec in mind, I also have to maintain my tone, make sure I have delivered my message exactly as intended, include all the context that lives in my head, and make sure I am not overwhelming Claude. Then I wait through increasingly slower responses, only to sometimes get a 503 overload error.

I do not need an agent. I want a small assistant.

Bad Agent! Bad Agent!

🎥 Fun Fact: Sitcoms are my favorite TV genre.

Alignment#

Reward Hacking#

Not a Reliable Relationship#

Last but Not Least#

To Sum Up#

Alignment

Reward Hacking

Not a Reliable Relationship

Last but Not Least

To Sum Up