I'm open-sourcing Bottega, our internal coding agent orchestration tool

We shipped the 1000th user story with our internal agent orchestration tool last week. To celebrate, I'm open-sourcing it: Bottega repo

We've been using Claude Code for more than a year now. Velocity has greatly increased, code quality too.

For the past 8 months, 100% of our production code has been written by agents. The workflow proposed in the specs is the distillation of that. Humans own the plan and the review, agents do the rest.

I'm not trying to argue against hand-written code, we just found a workflow that allowed us to reduce lead time and increase code quality. It works for us, in our context.

I'm sharing my own conclusion on what works and what doesn't.

We built this tool to formalize our current workflow. We wanted a tool that is minimalist, user-friendly, and easy to adapt.

A note before you read this — The End of the Token Subsidy Era

Anthropic just announced new pricing for Claude Code subscribers. The era of tokens that felt unlimited is slowly coming to an end.

Just when I was opensourcing this project, I love the irony. But Bottega's orchestration layer is easily extendable to add support for multiple providers.

So following Anthropic's pricing announcement, we added support for Codex and OpenCode, alongside Claude Code. Through OpenCode we can now run open-source models (Kimi, DeepSeek, and others).

The added benefit: we can now mix and match models within a single task. One user story can run Claude Opus for the planning, Claude Sonnet for the implementation, Codex for the code review, and an open-source model to manage the PR: monitor the CI, fix conflicts, push to merge. Each step of the pipeline picks the model that fits.

Our conclusion: we figured out how to successfully work with a coding agent. That lesson is platform-agnostic.

And some good news

Back in February, we had a bug on Bottega: Claude subprocesses were defaulting to Sonnet 3.7. We ran like that for two full weeks. We shipped dozens of tasks during this bug, unnoticed. Output quality was very similar to Opus 4.6.

with a tight enough harness and a rigorous enough process, the choice of model matters way less than people think.

For most web development task, the most powerful frontier models (Opus 4.7, GPT 5.5) is mostly a walking stick for a broken process. A solid development workflow alongside a robust harness gets us comparable velocity and quality with much less powerful models.

</unpopular opinion>

That being said, I also anticipate that token usage optimisation will become more and more prevalent.

Back to the initial post I had prepared...

What didn't work, and why we built this tool

We created this tool at the end of last year to address the issues we faced during our first 6 months of using coding agents.

Failure mode, PRs accumulate

The root cause was usually a combination of:

The final PR is okayish, but getting it to a mergeable state needs a lot of back & forth with the agent.
The PR looks ok but manual testing reveals issues, requiring again a lot of iteration with the agent before we can actually merge it.
The agent produces large PRs that are complex to review.

We decided to solve all of these problems at the planning stage. Before development starts.

This is not a novel idea. We have twenty years of Agile, XP, and BDD literature on the subject. We just realised that when applying this to agents, magic happens.

concrete examples, co-authored between the human and the agent, are the cheapest place to surface disagreement before code is written.

The plan is the centerpiece of the agentic development cycle. The quality of the final output is directly correlated with the quality of this plan.

Our failure mode was treating the plan artifact as disposable:

We provided a description of the task (light prompt).
Claude Code wrote the plan into a temporary folder.
The plan existed only as a session-bound guideline.

A task is not a prompt. A task is a requirement with acceptance criteria.

The task itself, the requirement, and the technical specification must all coexist as enduring artifacts that live alongside the implementation, not transient inputs to a single session.

Once the plan is detailed enough, and once the workflow ensures the agent rigorously executes it, we were finally able to produce PRs we could merge with zero, or minimal, back & forth. PRs stopped accumulating.

I think the term "autonomous coding agent" is extremely misleading. It focuses on time reduction rather than time investment. We had way better results once we shifted our focus to the HITL part of the process: where do we need to spend our time? At what stage of the workflow? In order to produce what?

It became quite obvious that we were wasting time at the end of the process, the PR stage naturally produces a bottleneck. We created this tool to help us invest our time at the beginning of the process instead, and have all downstream steps of the workflow as autonomous as possible.

What finally worked

We started to get great results once we finally managed to get Claude to behave just like a human developer.

Typical web-dev workflow:

Plan
Implement + unit tests
Manual testing locally
PR -> CI green
A team member reviews the PR, leaves feedback. The PR author (the agent) receives a notification & updates the PR.

Opinionated decisions we made along the way

Bottega is an attempt at automating this workflow in a simple and user-friendly manner.

As we ran more and more tasks in parallel, we decided that running this flow on a developer laptop made no sense. Why should the work stop if I close my laptop?

We quickly decided to set up this tool on a remote VPS.

Bonus 1: in terms of security, running these agents with skip-permission is less scary on a sandboxed VPS than on a local laptop.
Bonus 2: once the tool was accessible via a simple browser, everyone in the company started using it, not only developers (more on that later).

Plan as the first-class citizen

We use a plan template. A planning agent has a single goal: start from a task requirement and fill the plan template with a detailed technical implementation plan, interview the user, ask questions, etc.

The developer reviews the plan.

My personal conclusion is that the quality of the final PR is almost entirely dependent on the quality of this plan.

Crafting this plan is time consuming. I spend >50% of my time at this stage. The goal is to minimize the amount of surprise at the PR stage. If the PR is just a simple reflection of the plan, usually the PR can be accepted right away.

Highly interactive stage.

Getting an implementation that matches the plan 100%

Once the plan is approved, an implementation agent executes it. As soon as it's done, an adversarial code review agent kicks in, its job is to make sure the implementation strictly matches the plan and that no checkbox was silently skipped.

This is now a well known concept: the Ralph Wiggum loop

The two agents iterate until the reviewer is satisfied

Manual testing

The plan must include manual testing scenarios, e.g.:

playwright
curl
running a script
etc.

The agent runs these scenarios itself, just like a developer would before opening a PR.

This step was the major unlock: over the past few months, almost all issues discovered after a PR was written were due to acceptance criteria that were overlooked or missed during the planning phase (cf final quality of PR = quality of the specification). Once we got this step right we reached a level of quality similar to that of manual development. Software is rarely perfect: bugs happen, and most of the time, they are the result of scenarios that were genuinely missed during the feature design phase.

PR management

The agent creates the PR, then iterates until there are no conflicts and CI is green.

If I leave a comment on GitHub, it triggers a new agent run via a GitHub callback, again, the agent ensures CI stays green and solves any new conflicts that appear in the meantime.

Yet another orchestration tool

As we were working on this, a bunch of orchestration tools emerged. Variants of the same workflow we were converging on.

Conductor, by Melty Labs (YC S24).
GasTown, by Steve Yegge.
gstack, by Garry Tan.
SpecKit, by GitHub.

There is a lot of overlap with what we built. For us, this is a huge confirmation that we were on the right path.

Where Bottega differs:

Multi-harness. Bottega drives Claude Code, Codex, and OpenCode behind one interface, so you can assign a different model to each role on the same task.

Remote-first and multi-player. While you can run it on your laptop, Bottega is remote-first by design, we run it on a shared dev box. It has multiple concurrent users management out of the box. Side benefit1: sandboxing autonomous agents on a remote server was easier for us than sandboxing them on each laptop. Side benefit2: a lot of non-technical people use it internally.

Minimalist UX. The core ideas are super simple: we are just recreating the typical web developer workflow. And we wanted the tool to reflect that simplicity. Side benefit: easy to onboard the whole product team.

That's the core of it. The repo is here: github.com/vdaubry/bottega.

If you're building something similar, or you disagree with any of this, I'd love to hear from you.