Back to field notes

Atlas note

When more powerful tools make agents worse

Atlas got more capable when several narrow finance tools became one generalized scenario engine. It also got easier for the agent to choose the wrong path.

I thought agent tools should consolidate. If four tools answer overlapping scenario questions, replace them with one more expressive engine. Fewer APIs, more capability, cleaner implementation. That is mostly true inside the codebase. At the product surface, it can be more complicated.

In Atlas, a generalized scenario engine replaced several narrower commands. Internally, this was a good move. The engine could evaluate different kinds of financial changes through one pure computation path. It reduced duplicate logic and made new scenario types easier to add.

But the agent now had a larger action space. Questions that used to map cleanly to one verb could be answered by multiple paths. Some paths were technically valid but semantically weaker. Others existed only because an old flag had not been deleted yet. The tool became more powerful, but the product became less crisp.

Generality can live inside the engine. The interface can stay more specific.

Capability expands the routing problem

A generalized tool asks the agent to do more product judgment. Which mode fits? Is this a comparison question or a feasibility question? Is this a state update or a sandbox? Is the user asking "can I do this now" or "when would this become possible"?

Humans can tolerate some ambiguity in a command surface because we stop and think about intent. Agents often optimize for the first plausible route. If two routes work, the eval has to pick a canonical one. If the eval and the agent disagree, you get a rubric collision, not necessarily a reasoning failure.

The fix was specialization at the edge

The pattern that worked was not to abandon the generalized engine. It was to keep generality inside the implementation and restore specificity at the interface. Conceptually different questions got distinct verbs. Old flags that created overlap were removed instead of deprecated softly. Wrong paths failed fast with useful messages.

This made the public command surface more opinionated. In this case, that tradeoff helped. Agents seemed to do better with fewer plausible options. When the system already knows the distinction, the product layer can encode the routing decision.

Strict errors beat fuzzy success

Another tempting fix was fuzzy matching. If the agent guesses an approximate name, resolve it to the closest object and keep going. That is attractive for read-only convenience. It is dangerous for mutations and scenario changes. A fuzzy match can silently target the wrong thing while training the agent that guessing is acceptable.

The pattern that worked better was strict matching plus useful error envelopes. When the agent supplied an invalid name, the tool returned valid options and told it how to retry. The agent learned from the error at the exact moment it needed help, without adding another always-loaded prompt rule.

What I would keep

  • Generalize the computation layer before the interface.
  • Remove overlapping verbs and flags when they create routing ambiguity.
  • Use hard failures when an old path should stop being used.
  • Use strict matching with helpful errors before fuzzy mutation targeting.
  • Every new capability creates another choice for the agent.

More powerful tools are not bad. They just shift work from code into routing. If the product layer does not absorb that routing work, the agent inherits it. That is where some "the model got worse" stories quietly begin.