Skip to main content
  1. Posts/

The test AI didn't write

·924 words·5 mins·
Table of Contents

The promise of AI at work is throughput. More output, more coverage, more done per hour, and that promise is real. The part nobody puts in the pitch is the part that cost my team an incident: throughput and judgment are different things, and AI raises the first while leaving the second exactly where you left it.

Here’s how I learned the difference.

The plan was good. That was the problem.
#

A while back we had a major mobile release and a QA bottleneck at the worst possible time. The engineer who owns QA on my team was out, the release date didn’t move, and I picked up the testing myself. I lead the team; hands-on test planning isn’t the seat I usually sit in. So I did the reasonable 2026 thing and built the test plan with AI.

It was good. Genuinely good. It laid out the flows I’d have written and a few I wouldn’t have: error states, a dropped connection mid-purchase, the small disasters you forget under deadline. I read it, I nodded, and I felt the specific relief of a hard job made easy. That relief is the tell. I’ll come back to it.

We ran the plan. The release went out. A few days later it came back as an incident.

What a real QA would have asked first
#

The plan tested the feature on the new build. It never tested it on the older versions of the app still in people’s hands. The change broke on those older clients, the feature flag we’d have used to contain it never reached that path, and because no one was watching that corner, it ran unnoticed for days. By the time we caught it we were issuing refunds and make-goods, shipping a backend hotfix, and pushing a mobile hotfix behind it.

Here’s what stuck with me. The QA on my team would almost certainly have caught it. Not because she’s smarter than the model, but because she lives in that product. “Does it still work on the old app?” isn’t an edge case to her. It’s a reflex, the first question her experience asks before she reads a single line of the plan. That reflex is exactly what the model didn’t have, and exactly what I failed to supply, because I was covering a role whose instincts I don’t own.

The failure was trust, not the tool
#

It would be easy to say the AI failed. It didn’t. It did the thing it’s good at: it amplified me. The trouble is what it found to amplify. In my own seat I bring judgment that catches a plan’s blind spots. In the QA seat I brought almost none, so the model amplified a gap instead of a strength, and it did so fluently enough that I never went looking for the seam.

That’s the failure mode worth naming, and it’s an old one wearing new clothes: automation complacency. A system that’s right most of the time trains you to stop checking the times it’s wrong. The better the output looks, the less you audit it, and a polished, edge-case-rich test plan looks an awful lot like a finished one. The relief I felt reading it was me stopping early and calling it done.

The human skills are the load-bearing ones now
#

I think this is the shape of the next few years, and it lands hardest on new leaders. AI will hand you throughput in places where you’re not the expert. You will cover the role, draft in the unfamiliar language, ship in the stack you barely know. The output will look right. And the only thing standing between “looks right” and “is right” is judgment you may not have in that domain.

So the human skills aren’t the soft, optional layer on top of the technical work. They’re the part holding the rest up. Knowing which question the model didn’t ask. Knowing which five percent to re-check. Having the resilience to slow a release when everyone wants it out, and the cognitive room left over, after the AI handled the easy ninety percent, to notice the hard ten. Throughput went up, and the tax on weak judgment went up with it. You don’t out-type that tax; you pay it down by getting better at the ten percent the model can’t reach.

What I changed
#

Three things, concretely.

When AI covers for a role I don’t own, I borrow that role’s instincts on purpose. Before I trust the artifact, I ask the person whose seat it is, even async, even one line: what’s the first thing you’d check? Five minutes of a real QA’s reflexes would have bought back the entire incident.

I treat a clean-looking AI output as a reason to audit harder, not a reason to relax. When the work arrives polished, that’s precisely when I have the least signal about what it missed. Relief is now my cue to slow down, not speed up.

And I stopped mistaking coverage for correctness. Ten thoughtful test cases that skip the one environment your users actually run isn’t ninety percent of a test plan. It’s a confident zero on the case that ships the bug.

AI did not write the test that mattered, and it was never going to. It can amplify the judgment you bring to the table. It cannot lend you the judgment you don’t have. The real work, the human work, is becoming the kind of engineer, and the kind of leader, worth amplifying.

Chandler Thompson
Author
Chandler Thompson
I lead engineering teams and coach the people who run them. This is where I write down what actually worked.

Related