Daniel Chu / notes from real workflows
Notes from building AI tools that have to survive real workflows.
I write about the awkward gap between AI demos and work that has users, deadlines, data, quality bars, support paths, and people who need the system to behave.
Some notes are polished. Some are field observations from things I am still trying to understand: agents, tools, workflows, evals, recovery paths, and product judgment.
Current focus: multi-agent development systems, workflow evals, and recovery paths for AI actions.
Threads
Problems I am working through.
Agent tool use
When an agent says it used a tool, how do we know?
The prose can sound grounded even when execution did not happen.
Generalized tools
Why does giving agents more power sometimes make them less reliable?
How narrow should a tool be before it stops being useful?
Workflow evals
What would an eval look like if it tested the real workflow?
Many evals test chat, while product behavior lives in files and side effects.
Recovery paths
If an AI system writes data, what should undo and audit look like?
How much recovery should be automatic before it becomes another risk?
AI-native messaging
What if people, files, events, and emails were structured context instead of attachments?
The hard part is not the entity card. It is deciding who owns the context.
Field notes
Essays and drafts from building, breaking, and thinking in public.
AI Features Are Not the Same as AI Leverage
ObservationWhen an agent says it used a tool, that is not evidence
Failure modeAI agents debug symptoms before systems
ExperimentWhen more powerful tools make agents worse
EvaluationEvals should test the workflow, not the demo
Design noteIf an AI can write data, it needs a recovery path
Open questions
These are not conclusions. They are questions I'm still working through.
- What should an AI system refuse to automate?
- When does flexibility become unreliability?
- How should agents explain uncertainty without creating more work?
- What does a good recovery path look like for AI-generated actions?
- What kinds of constraints make agents more dependable?
- How do you evaluate work when the task itself is ambiguous?
Artifacts
Small diagrams, sketches, and work-in-progress fragments.
Compare notes