Atlas note

Self-improving agents need evidence, not just autonomy

If every improvement starts with a human naming the failure, the system is not self-improving in the sense I care about. It is human-improved with agent assistance.

One Atlas failure that changed my thinking was not a bad answer.

Voya, my travel agent, sent a reservation email before I had explicitly approved the final message. The right behavior was not subtle: show the full To, Subject, and Body, then wait for an exact send approval. A reminder or follow-up task should be allowed to draft an email. It should not silently become permission to send one.

That kind of failure feels different from weak prose. A bad answer can mislead. A bad action can change the outside world.

The failure surface is wider than chat quality. Once an agent can act, observability has to cover actions, side effects, and missing confirmations too.

Bad answer

Looks plausible, but the grounding is missing.

A finance answer cites the right kind of command but does not actually use the source of truth.

Bad action

Something happens before the contract is satisfied.

A travel email is sent before I approve the exact draft. A mutation times out after work may already have happened.

For a while, my improvement loop started with me noticing those moments and turning them into issues.

Once the issue existed, the rest of the machine could move. The issue could become an eval. The eval could drive a fix. The fix could be tested, deployed, and replayed against the live system.

That loop was useful. It was also more human-bottlenecked than I wanted to admit.

The loop that worked after I named the failure

Atlas is a personal AI system that runs real workflows for my family and for me. It handles finance, health, travel, coaching, career tracking, and operational automation. It runs on an Azure VM, talks to us through Telegram, reads real data, and has to behave consistently enough that we can trust it.

It also has a fairly serious eval and deployment loop. When something becomes concrete enough, the path is clear:

Capture the failure.
Turn it into a GitHub issue or durable artifact.
Write a knowledge eval and a workflow/action eval.
Fix the code, prompt, skill, or workflow.
Run the targeted evals and adjacent regression tests.
Deploy.
Replay the behavior on the live system before calling it shipped.

The weak part was the first step.

"Capture the failure" cannot mean "I remember something felt wrong." It has to mean the system leaves enough evidence to reconstruct the broken product contract.

This is where I had been too shallow about observability. I was thinking too much about events and not enough about the relationships between events.

A log can tell me that Voya sent an email. That is the easy part. The useful trace has to answer a harder question: did this external side effect have a valid causal path from a human approval of the same payload?

If the answer is no, I do not want another dashboard panel. I want an improvement candidate with a broken invariant attached to it.

For action-taking agents, observability has to prove the contract around the side effect, not just record that the side effect happened.

Draft branch Agent prepares a candidate email.

Internal planning can revise, retry, and summarize. None of that should create permission to mutate the outside world.

Human contract Approval attaches to one immutable payload.

The approval is not a general blessing to "handle the trip." It is a typed edge from a person to a specific action version.

Boundary check The side effect is blocked if the edge is missing.

The failure is not "email sent." The failure is "send transition fired from a branch with no approved predecessor."

Replay contract The same trigger should now stop at draft-only.

The regression is a replay of the action path, not a unit test that merely confirms a confirmation string exists.

The liquidity example had the same shape. The bug was not that the system forgot to print a number. It was that source confidence got flattened before the answer reached me. Confirmed cash, estimated card debt, stale balances, and current transactions were allowed to collapse into one clean-looking summary. The observability fix was to keep provenance and freshness attached long enough for the answer layer to say, "this aggregate is not trustworthy yet."

The long Telegram mutation was another version of the same problem. "I am working on it" is not a terminal state. A useful trace had to separate accepted, mutation started, outside state changed, user notified, and recovery path triggered. Otherwise a timeout could look like a backend nuisance while the user was left unsure whether anything had happened.

That is the observability bar I care about here. I want the system to notice broken invariants: an external action without an approval edge, an aggregate that erased source confidence, a workflow that reached no user-visible terminal state, or an internal diagnostic that leaked into the answer. Those are the artifacts that can become evals and fixes without waiting for me to remember the bad moment.

The missing step was creating evidence

The hidden assumption in the old loop was that the failure would somehow become durable evidence first.

If I noticed a bad response and opened the issue, the loop worked. If a scheduled job failed loudly, the loop worked. If I explicitly told an agent to create a self-learning issue, the loop worked.

But a lot of agent failures do not fail loudly.

They look like a confusing answer in a chat. A stale assumption that survives because nobody challenges it. A progress message mistaken for delivery. A cron run that looks green because something was sent, even though the actual task outcome was missing. A tool error that becomes the user-facing answer instead of being retried, translated, or turned into an improvement candidate.

Or, more seriously, an agent takes an action before the human confirmation contract is satisfied.

The artifact does not need to be a postmortem. It needs enough structure for the next run to know what failed.

Observed

External-send transition fired from a trace branch with no approved predecessor.

Expected

The action runner refuses a side effect unless the approval graph contains the right edge.

Evidence

Trace links from draft branch to presentation, approval, boundary check, and provider receipt.

Regression

Replay the same reminder path and require draft-only behavior until approval exists.

That last point is important. The artifact is not just a log. It is a small record of the product contract that was broken.

The first observability failure was observability

At some point I got tired of being the person who had to say, "this is wrong, open an issue."

The next move seemed straightforward: let the system observe itself. Let it collect artifacts while it runs. Let nightly analysis look across those artifacts and ask:

What kept going wrong?
Which failures share a root cause?
Which agent needs a new rule, eval, or tool?
Which workflow needs a stronger delivery contract?
Which assumption kept recurring without being challenged?

The first version produced almost nothing useful.

Not because nothing was wrong. Because the system was not observable enough to learn from itself.

Later, the deployed miner found hundreds of signals across GitHub, git, cron, and logs. But the first live verification also found a boring bug: my local assumption about cron history was wrong. On the VM, OpenClaw required listing cron jobs first and then fetching each job's runs by id. Until that was fixed, cron-derived behavior could quietly disappear from the evidence stream.

That was a good reminder. The observability system itself needs observability. A green nightly job is not enough if the sources it mines are incomplete.

Logs are not the same as learning

One trap I hit was assuming that having logs meant having learning material.

It does not.

Logs are raw material. They can tell you what happened at one layer of the system. They usually do not tell you why it mattered.

A cron log can say a job finished. That does not mean the user got the thing they needed.

A tool log can say a command failed. That does not say whether the agent retried it, translated it, escalated it, or dumped the raw error into chat.

A sent-message record can prove an email left the system. It does not by itself say whether the user approved that exact draft.

Learning requires interpretation.

For Atlas, that meant creating a collector that can look across GitHub issues, commits, cron runs, logs, and durable action artifacts. The goal is not one issue per log line. That would just turn observability into noise.

The goal is to turn messy traces into a small number of improvement candidates with stable evidence keys.

Manual noticing versus artifact loops

The practical difference is not that humans disappear from the loop. I still notice things. I still care about taste, usefulness, and whether an answer actually helped.

The difference is what happens after the noticing.

The goal is not to remove human judgment. The goal is to stop making human memory the only place failures persist.

Manual mode

I notice a failure.
I remember enough context to explain it.
I ask for an issue or fix.
The next similar failure depends on me noticing again.

Artifact mode

The run leaves observed and expected behavior.
The issue links to evidence and an agent/workflow target.
The fix must add knowledge and workflow/action evals.
Deployment is followed by live replay and recurrence checks.

That second path changes the system pressure. The agent is no longer allowed to improve only in prose. It has to leave behind a regression harness for the behavior that failed.

Green status needs a contract

One of the more subtle failures in agent systems is confusing delivery with completion.

A scheduled workflow can send a progress message. The delivery layer can mark that message as sent. The cron history can look green. But the actual final result might never arrive.

From the scheduler's perspective, something happened.

From the user's perspective, the job failed.

Agents make this more likely because they are good at producing plausible intermediate text. They can explain what they are doing. They can say they will follow up. They can produce a partial answer that looks enough like movement for the surrounding system to mark the run as successful.

A job should not be considered successful because it spoke. It should be considered successful because it produced the required artifact, summary, state change, or user-visible outcome.

For self-improvement, that means the observability layer has to watch for contract failures, not just technical failures.

Did the job return the expected terminal state?
Did it save the snapshot it claimed to save?
Did it log the activity in a place the next run can inspect?
Did it ask the user for data it could have retrieved itself?
Did it send, write, or mutate anything without the right confirmation?

Green is only meaningful when it is tied to the right contract.

What I would build earlier next time

If I were starting over, I would build observability earlier.

Not dashboards first. Evidence first.

I would make every important agent workflow answer a few boring questions:

What does success mean here?
What evidence proves success?
What user reaction counts as a negative signal?
What tool outputs should become artifacts?
What stale source-of-truth conditions should be treated as failures?
What action requires explicit confirmation?
What terminal state should a scheduler require?

The compounding loop only works if the first step creates evidence.

Failure Artifact Issue Eval Fix Replay

I would also resist the urge to make every failure immediately actionable.

Some artifacts should simply accumulate. One weak response might be noise. Five weak responses with the same shape are a product bug.

That is another reason durable evidence matters. Humans are bad at remembering distributions. We remember the last painful example. The system needs to remember the pattern.

The part I keep coming back to

Autonomy without observability does not compound.

An agent can be powerful, fast, and useful. But if its failures do not become evidence, the improvement loop stays human-bottlenecked. You end up with a system that can execute fixes, but cannot reliably discover what needs fixing.

That is still useful. It is just not self-improving in the stronger sense.

The more agents I run, the more I think the hard part is not only giving them more tools. It is making their behavior legible enough that the system around them can learn.

Not every failure needs a meeting.

Not every mistake needs a postmortem.

Not every weak answer needs a human to stop everything and debug it.

But the important failures need somewhere to go.

Because if an agent fails and the system cannot see it, nothing improves. The user adjusts. The human compensates. The same thing happens again next week with slightly different words.

The goal is not agents that never fail.

The goal is agents whose failures leave enough evidence for the next version of the system to get better.