I'd been keeping track of AI code-engineering tools for years, and honestly? I wasn't impressed.
Copilot never did much for me beyond fancy autocomplete and printing out the obvious unit-test coverage. A good linter would have covered most of its benefits, and I just didn't see the future I kept being told was already in front of me. So every few months I'd do the same thing: spend an evening pulling the current crop back down, throwing real work at them, seeing what had actually changed. For a long time the answer was not much.
Then sometime last fall, October I think, someone mentioned Cursor. I'm fairly sure it was my VP. I gave it a shot, mostly out of the same habit as always.
This time was different. The tool was writing actual code for me. Not stellar. I'm not going to pretend the first session was magic. But for the first time it was clearly, unmistakably worth investigating. So I dove in. Deep. I haven't really come up for air since.
Being me, I didn't want to run on vibes. I put together a very primitive set of benchmarks and ran the same real-world tasks through the serious contenders: Cursor, Codex, Windsurf, Augment, and Claude.
The benchmark was three dumb questions
It was not science. I had a handful of tasks I could throw at any tool in a few minutes and learn something real from the answers:
- "Describe the architecture of the work inside this repo."
- "Find the three biggest design issues in our system."
- "Fix the bug with ID X."
Silly, primitive, the kind of thing you'd be embarrassed to call a benchmark in a meeting. But they were great for one thing: a fast read on whether a tool actually understood what it was looking at, or was just good at autocompleting the next line.
Round one went to Augment
My first results pointed pretty clearly at Augment.
Augment won that round because at the time it was the only tool that actually understood our codebase. We've historically had some people sneak in what I call "blog-post architecture" (yes, I see the irony), and Augment saw right through the mish-mash of patterns and folder structures. The others were close behind, and roughly as good at writing code in isolation. But only Augment could take a prompt about two completely unrelated parts of the application and reason across both of them at once.
That was the whole game for me. Writing a clean function in a vacuum is table stakes now. Holding the shape of a real, messy, lived-in system in its head while it does that is the thing that's actually hard, and at the time Augment was alone in being able to do it.
It helped that Augment let you choose the underlying frontier model (Anthropic, OpenAI, even open-weight options), so it wasn't really one tool versus one model. My team jumped in with me, and for a while Augment was the house pick.
The tool that hooked me didn't win
Here's the part I didn't expect: Cursor, the tool that pulled me into all of this in the first place, came in the middle of the pack.
It gave the best first impression by a mile. It was the one that made me believe the category was real. But when I actually measured it against the others on our code, it was a merely-good daily driver, not the winner. The demo was the best part. The whole-system reasoning, the thing I'd started to care about most, just wasn't there yet the way it was in Augment.
The moment it lost me was small and very telling. I asked it, in plain words, for a Web API endpoint. It went off and built me an Azure Function instead. Not a wrong-but-defensible reading of a vague request. I had said the exact thing I wanted, and it quietly decided it knew better. On a small task that's a thirty-second fix. On a big one it's the kind of detour you don't notice until it's three files deep and load-bearing.
I keep saying this to people: the tool that converts you and the tool that wins are allowed to be different tools. Cursor did its job. It just wasn't the job I thought it was.
Then the frontier moved under all of it
While I was busy running my little benchmark, the juggernauts were stomping around and getting faster with every step. Anthropic and OpenAI kept shipping: faster, then better, then faster, then more capable, then faster again.
By the end of January, Claude was clearly out ahead, and our entire company moved that direction.
If there's one durable lesson in here, it's that one: any ranking you make of these tools has a shelf life measured in weeks. The thing that won in November was not the thing that won in February. I stopped treating "best tool" as a fact and started treating it as a snapshot.
The stack I actually reach for now
Here's the honest inventory, June 2026. It comes in two versions, because what I run on my own licenses at home and what I run inside a corporate Azure tenant are not the same stack, and pretending they were would be the curated answer.
At home, on my own dime, it's two pieces: Claude and Augment. Claude Code is the first thing I open, and it calls the Augment Context Engine over MCP, so Augment's codebase comprehension shows up as a tool Claude reaches for on its own, mid-task, without me asking. The two tools I spent an evening pitting against each other last fall now run stacked on top of each other instead. That's the part of the setup actually worth copying.
It's the cleaner ending to the benchmark story than I expected, too. The tool that beat everyone in November didn't lose when Claude pulled ahead in January. Its best trick just got absorbed. The comprehension was always the valuable part, more than the speed, more than the autocomplete, and it ended up exactly where it's most useful: inside the thing I already have open. (I'm keeping Claude Code itself brief here, because it's getting its own post in a couple of weeks.)
At work it's messier, the way work stacks always are. Call it ninety-five percent Claude Code and Claude through Azure AI Foundry, because when the license isn't yours the enterprise plumbing is half the decision. Copilot and Codex are still around for the other five percent, more for wrangling tooling and documents than for any of the real building. Less elegant than the home setup. More honest about what shipping inside an actual company looks like.
The thing that actually compounds
The tools matter less than the shift in how I work, which is the part nobody put in a launch video.
I stopped writing the first draft of things. Not the prose, the code. I describe the goal to an agent the way I'd describe it to a sharp colleague who just walked in with no context, I read the plan it comes back with, and only then does anything get written. My job moved from typing the thing to deciding whether the thing is right. That sounds small. It compounds enormously, because I review far faster than I type, and the review is where the value was hiding the whole time.
(If that "smart colleague who just walked in" framing sounds familiar, it's the same habit I wrote about a couple weeks ago. It turns out a prompt worth reusing and a prompt worth handing an agent are the same prompt.)
One more thing
All of this, the benchmark, the stack, the workflow, is what I used to build something I'll show you in a few weeks. I call it TouchDown. It plays football, sort of. That's all I'll say about it for now.
Where this actually leaves us
These tools are genuinely good now. Better than I expected a year ago, when I was still confidently telling people Copilot was a fancy linter. The future I kept getting sold did mostly show up. It just showed up quietly, as a better daily workflow, instead of loudly, as the magic the marketing promised.
They're also still oversold in the specific ways that matter: the demo is always better than the Tuesday, the "it just works" examples are always the easy ones, and the ranking you trust today is wrong by next quarter. Both things are true. The trick is to keep re-running your own dumb little benchmark, because the only honest answer to "which AI tool should I use" is "the one that won when you last checked, so check again soon."
Case in point: I'd bet my home stack looks nothing like this by the fall. I'll have torn it down and rebuilt it around something I don't have a clean name for yet. That isn't me hedging. It's the one prediction in here I'd actually put money on.