AI Agents

Most Things You Could Build Aren't Worth Building

An AI bug-fixing agent sounds obvious until you price the operating loop: permissions, context, evals, runtime evidence, reviewer burden, and proof that it changed the workflow. The better product may be diagnosis before code generation.

Photo by Joshua Hoehne / Unsplash

I talked myself out of a project I wanted to build.

I am always looking for places where automation can remove real operational load. Support issues seemed like a good candidate, especially after customer service escalated a ticket to engineering.

Bugs are a steady load. For us, they represented about an eighth of engineering capacity. An agent that reads the customer report, inspects the code, drafts a fix with a test, and opens a pull request sounds like a natural fit. The demo looks like engineering work.

The product question is harder: does the system move a bug from reported to understood to safely fixed, with enough evidence to trust the handoff? Capability starts the conversation. The product work is context, permissions, evals, stop conditions, reviewer burden, and proof that the workflow changed state.

So I started before code generation.

Start with permissions

The agent was never going to merge code or get production access.

We sell to regulated companies, including banks, healthcare organizations, and EU customers. Raw customer data was out. The workflow had to use approved context: tickets, logs, safe reproduction data, source code, and internal discussion.

Those constraints define the product. A bug-fixing agent with unlimited context is different from one that works inside regulated boundaries. The bounded version is the real one.

That made the first test simple. Before asking whether the agent could write the fix, I needed to know whether it could identify the cause from allowed context.

Test diagnosis first

I connected the agent to Jira, Zendesk, and Slack through MCP to give it the same messy context engineers use at triage: support report, ticket history, related discussion, and code.

Then I built a root-cause skill. For each issue, the agent had to answer a narrower question than "fix this bug." Was this a defect, expected behavior, a product gap, or not enough information? If it saw a defect, it had to cite the suspected cause and supporting code path or product behavior.

The agent updated the ticket. A second pass judged whether the issue had enough context to be a fix candidate and labeled it.

Then we waited.

That waiting period was the test. The agent could sound convincing immediately; the real comparison only became possible after engineers fixed the issues through the normal process. Once real fixes existed, I compared the agent's diagnosis against the shipped code. It got credit only when its suspected cause matched the actual fix.

That gave the experiment a measurable delta.

Plausible is expensive

A plausible but wrong fix is expensive. A reviewer who trusts it can ship a regression. A reviewer who does not trust it has to re-derive the diagnosis, which can cost more than writing the fix would have. In our sample, the system was plausibly wrong about half the time.

The first sample made the boundary clear. The agent was strong on self-contained backend logic, configuration, validation, and template bugs. One form submission dropped a field because the payload exceeded PHP's input-variable limit; the agent proposed the JSON-body fix that shipped. Another bug came from a missing token in an email template's substitution map; the agent got that right too.

The misses were mostly runtime behavior, cross-component data flow, bundling, and two systems that looked similar but were not. The agent could reason from source. It struggled with behavior that only appeared in production.

Our system makes that harder. Each customer has a different feature mix, and some have custom behavior. That is common in enterprise software, and it turns a clean demo into an operating problem.

The realistic solvable range looked like 10–20% of issues. Given the 12% engineering allocation, that translated to roughly 1–2% of total engineering capacity. For a larger organization, that might be meaningful. For a smaller team like ours, maintenance, review load, and upkeep would likely consume the savings.

The operating loop is the product

The visible part of a bug-fixing agent is the pull request. The product is the operating loop around it.

If the agent needs runtime evidence, it needs a way to spin up the app, seed data, hit the failing flow, and read the error. In a large monorepo, that is standing infrastructure with an owner.

Then comes prompt upkeep, connector maintenance, model churn, and review tax on every plausible miss. The team also needs release rules: eligible bug classes, required tests, stop conditions, and reviewer evidence.

This is where the workflow stops being a model problem and becomes a product system. Model output is cheap. Dependable use requires context plumbing, permissions, evaluation data, recovery paths, and review that is lighter than doing the work manually.

The full version did not clear that bar. It might reclaim some bug-fixing capacity, but the operating cost and review burden were likely to land in the same range.

Ship the smaller version

The smaller version is better: triage issues, draft root-cause analysis, label likely fix candidates, and stop before code generation.

That gives engineers a head start without pretending the agent has earned ownership of the fix. It also creates the dataset for the next decision: diagnosis accuracy by bug type, candidate precision, reviewer usefulness, and time saved.
If those numbers improve, the agent can earn a narrower write path: start with the bug classes where it was strong, require tests, keep human review, and measure rework. If they do not improve, the team still gets better triage without auto-generated fixes.

Several reviewed bugs were regressions. A faster bug-fixing system does not reduce the rate at which the product creates bugs. Better tests around the paths that keep breaking may return more than faster repair.

Make the agent prove itself at the cheapest useful layer. Diagnose before fixing. Label before opening pull requests. Compare against shipped fixes. Price review as work.

Most AI agent ideas should start here: what context may it see, what action may it take, what state should change, what evidence proves it helped, and where does the human stay in control?