Atlas note
AI agents debug symptoms before systems
A capable agent can produce a very sophisticated explanation for the wrong failure. The missing habit is often simpler: first prove which world you are looking at.
One Atlas incident started with a basic mismatch: one view of the system reported a number that did not agree with another view. The agent investigated for a while and produced a plausible database theory. It sounded technical and it fit the symptom, but it was looking in the wrong place.
The actual problem was not one bug. It was three small boundary failures stacked together: a stale workspace copy of the database, a taxonomy mismatch that tests had accidentally normalized, and a query that used a slightly different definition than the headline number. None of those is exotic. Together they created a confusing world where the agent could explain the symptom without touching the cause.
The agent skipped the boring checks
Human debugging often starts with boring questions. Am I looking at the right environment? Is this the same database the app uses? Are these two screens applying the same filter? Did a constant drift from the underlying data?
Agents tend to start one layer deeper. They are good at producing a coherent mechanism once they see a symptom. That strength becomes a weakness when the first job is to verify the boundary of the system. A stale file can look like a persistence bug. A naming mismatch can look like missing data. Two definitions of the same metric can look like a race condition.
Make provenance visible
The repair pattern was to reduce the number of places an agent could inspect private state directly. Raw workspace copies are tempting because they make an agent feel self-sufficient. They are also a maintenance trap. Once copied, they become another version of truth.
Atlas moved more reads behind canonical CLI commands. Those commands know where the live data is, apply the shared business definitions, and return a shape designed for machine use. The agent can still reason, but it reasons over a source that the rest of the system also trusts.
The commands also became more useful when they printed enough provenance to debug the answer: the data source, the time window, the filters, and the semantic definition being applied. It is basic UI, but it gives the agent fewer ways to narrate the wrong system.
What helped
"Agent debugging" sounds like an intelligence problem. A lot of the time, my current read is that it is a product-surface problem. In Atlas, the system got better when the boring checks became cheap and obvious:
- Prefer canonical commands over copied workspace state.
- Centralize definitions like spending, income, active, or current.
- Show provenance in tool output, especially filters and data source.
- Write tests against real contracts, not shared mistaken constants.
- Teach the agent to verify the environment before explaining the symptom.
Agents need help with debugging order. They can reason through a system, but I want the product to make "am I looking at the right system?" the first move.