Back to projects

Microsoft Copilot Mobile / Project

Copilot Mobile Multi-Agent Development System

I created and shipped a production multi-agent development system for Microsoft Copilot Mobile. The goal is simple to say and hard to do: make mobile development more accessible while keeping quality, review, and recovery visible.

This page stays at the level I can discuss publicly. I am not sharing Microsoft internal architecture, prompts, private code paths, review heuristics, customer data, or implementation details that belong to the product.

What I can talk about is the shape of the work: I was the original builder, I pioneered the eval-driven setup, and I am still an active developer of the system.

What changed

Before this, many useful product and UX changes still depended on an engineer to start, translate, or carry the implementation. That slows down PMs and designers, but it also slows down engineers. An iOS developer may understand the product change and still have to wait on Android context, or the other way around.

The system gives people a safer path into Copilot Mobile development. PMs and designers can use it directly, with engineering review. Mobile engineers can use it to move across platform boundaries. The point is not to remove engineering judgment. It is to make more of the work executable, reviewable, and testable.

The first strong signal was adoption. In its first month, PM and Design teammates used the system directly to create and merge a meaningful volume of production PRs with engineering review, enough that review capacity became the next bottleneck.

What the system enabled

The clearest shift is that development no longer has to start only with the person who already knows the exact platform, code path, or implementation routine. A PM responsible for measurement can make instrumentation and metrics changes directly. A designer can carry product details closer to the final implementation. An engineer can move faster in an unfamiliar part of the mobile stack.

This only works if the system treats review as part of the product, not as a ceremonial last step. When the review queue became the next bottleneck, I built an auto-review layer that evaluates the risk of a PR and recommends the depth of review it needs. The public lesson is that agent systems do not end at code generation. They have to carry the change through verification, risk, and recovery.

What I am learning

  • Useful agents need evals before they need more autonomy.
  • Cross-functional development works when the handoff is visible, not hidden.
  • Review capacity becomes a product constraint once many more people can create PRs.
  • Quality improves when the system makes risk legible before a human reviewer opens the diff.
  • Enablement is real only when people can ship safely outside their usual comfort zone.

Why eval-driven mattered

My main contribution was not just wiring agents together. I pioneered an eval-driven approach for setting up the system: using evals to shape agent roles, test real workflows, catch regressions, compare approaches, and decide when a path was reliable enough to trust.

That changed the development posture. Instead of asking whether the agent sounded plausible, I wanted to know whether the workflow held up: did the right files change, did the review surface make sense, did the system expose risk, and could a person recover if something went wrong?

Evals Evals should test the workflow, not the demo Product thinking AI Features Are Not the Same as AI Leverage Trust If an AI can write data, it needs a recovery path Tool design When more powerful tools make agents worse