Atlas note
When an agent says it used a tool, that is not evidence
One of the stranger Atlas failures was not that the agent got an answer wrong. It was that the answer looked tool-grounded even when the tool path had not actually become the agent's default behavior.
Atlas has a private finance agent that answers questions by calling CLI commands over a local data layer. The clean version of the design is simple: for important math, I do not want the agent improvising in prose. I want it to call a command, read the result, and explain the result to the user.
The failure was more subtle. The agent produced a forecast that was mostly right, cited the kind of command it should have used, and then changed its story when challenged. It had enough context to assemble a plausible answer manually from remembered facts, snippets of local documentation, and arithmetic in the response. But the product contract I wanted was narrower: the command should be the source of truth for important math.
The text channel lies by accident
The LLM grader saw only the final message. If the answer said, "Using the forecast command, the result is..." that looked identical to real tool use. The grader could not tell whether the agent ran the command, quoted cached knowledge, or invented a command-shaped explanation because that is what a good answer usually sounds like.
That distinction matters because agent systems have two surfaces: the execution trace and the narrative trace. Users and most graders see the narrative trace. The system's actual safety depends on the execution trace. If those are allowed to drift, an agent can appear disciplined while routing around the discipline.
The fix was infrastructure, not confidence
The first instinct is to tell the agent, more firmly, to use the tool. That helped a little, but it did not change the shape of the system. The more durable fix was to make the tool path easier than the manual path.
Future events moved into a durable data table. The forecast command learned to include those events automatically. The eval stopped rewarding claims about commands and started checking whether the answer matched the data the command would produce. In other words, the test moved from "did the agent describe the right process?" to "did the answer reflect the right source of truth?"
Environment setup is part of the product. If an eval runner points at one database while the tool gateway points at another, the agent can look unreliable when the real bug is setup. If a workspace contains stale data copies, the agent can do sincere work against the wrong world. Both are product failures in an agent system.
What I would keep
- Grade observable outputs, side effects, and source-of-truth agreement.
- Make the intended tool path shorter than the improvised reasoning path.
- Remove stale local copies when there is a canonical read path.
- Test the environment, not just the prompt.
My read is that an AI agent does not become reliable just because it knows the right command exists. It becomes more reliable when the system makes that command the easiest way to succeed, and when the eval can see whether that actually happened.