Back to selected work

Personal AI workflow and eval lab

Atlas

Atlas is my private AI workflow system for testing whether agents can use tools, preserve context, respect sensitive data boundaries, recover from mistakes, and keep scheduled work observable.

I do not publish Atlas source code or raw data. The useful public version is the builder evidence: the system shape, the failure modes, the eval gates, and the product decisions that made agent behavior easier to inspect.

The project started as CLI-first personal software and grew into a practical testbed for agentic workflows. The durable lesson has been that AI products are less about impressive one-off answers and more about operational reliability: source-of-truth discipline, safe writes, eval harnesses, workflow replay, recovery paths, and interfaces that make the right action easy for the agent.

Architecture

The public diagram below is intentionally simplified. It shows the product shape without exposing private schemas, credentials, raw records, or operational secrets.

How the pieces fit

The CLI is the product's spine. Agents are pushed toward commands instead of important reasoning from memory or raw database queries. That makes behavior easier to test and gives the agent a smaller, more trustworthy surface.

The Azure VM runs the always-on pieces: scheduled jobs, a daemon, health checks, and notifications. The VM is useful because it forces product questions: what should run automatically, what should require confirmation, and what should still work when the AI layer is unavailable?

The eval harness exists because prompts alone were not enough. Atlas replays real failure shapes, checks source-of-truth behavior, and catches cases where the agent sounds right but uses the wrong path. Over time, the evals became as important as the agent itself.

Product questions Atlas keeps raising

  • How do you make a tool the path of least resistance for an agent?
  • When should an AI ask for confirmation before writing data?
  • How do you evaluate behavior that happens in files or side effects, not chat?
  • What belongs in always-loaded context, and what belongs in a command?
  • How do you design recovery when the AI is the thing that failed?
Tool use When an agent says it used a tool, that is not evidence Debugging AI agents debug symptoms before systems Tool design When more powerful tools make agents worse Evals Evals should test the workflow, not the demo Trust If an AI can write data, it needs a recovery path