I Designed a Human-Less Agile Team with Claude and Shipped a Production App in Hours
Most people building with AI agents are automating single tasks. I wanted to automate the whole team.
Over the past few months I've been building ai-dev — a multi-agent orchestration system that acts as a complete engineering team. It takes a project idea from raw description to production-ready code, with full PR review cycles, dependency management, and automated feedback loops. No human writes a line of implementation code. I play one role: architect and director.
The result: I used it to ship a full vocabulary-learning app — authentication, Claude-powered word generation, push notifications, dashboard UI — from a blank repo to a polished, invite-only beta. The agents handled all of it. I handled the architecture, the constraints, and the final review.
The Problem I Was Actually Solving
Building software at speed isn't a coding problem. It's a coordination problem. You can write fast, but you still need someone to spec features, another person to implement them, a reviewer to catch regressions, and a process to handle feedback. Even solo, that context-switching is expensive.
I wanted to compress that entire loop. Not by using AI as a smarter autocomplete, but by designing an actual team structure where each role is played by a specialized agent with defined responsibilities and real handoff boundaries.
The insight was that the coordination overhead — not the typing — is what slows software teams down. If I could eliminate coordination cost while preserving review quality, I'd have something genuinely different from "AI writes code."
The Architecture
ai-dev is a Django application running on PostgreSQL with Celery + Redis for async task orchestration. GitHub integration is first-class. There are two agent roles: a Tech Lead powered by Claude Opus 4.6, and Dev Agents powered by Claude Code.
Here's how the system is structured:
┌─────────────────────────────────────────────────────────┐
│ USER / DIRECTOR │
│ (architecture decisions, product scope, │
│ approval gates, final review) │
└────────────────────────┬────────────────────────────────┘
│ idea + constraints
▼
┌─────────────────────────────────────────────────────────┐
│ TECH LEAD AGENT │
│ (Claude Opus 4.6) │
│ │
│ - Asks clarifying questions via conversation loop │
│ - Generates versioned TechSpec (architecture doc) │
│ - Decomposes into DevTasks with dependency graph │
│ - Tools: list_tasks, upsert_task, update_tech_spec, │
│ update_project_status │
└────────────────────────┬────────────────────────────────┘
│ approved spec + task backlog
▼
┌─────────────────────────────────────────────────────────┐
│ PROJECT MANAGER (Celery) │
│ │
│ - Polls for pending tasks with no unresolved blockers │
│ - Claims available workspace (atomic lock) │
│ - Dispatches Dev Agent per task │
└──────┬──────────────────────────────────────┬───────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ DEV AGENT │ │ DEV AGENT │
│ (Claude │ ... │ (Claude │
│ Code CLI) │ │ Code CLI) │
│ │ │ │
│ branch → │ │ branch → │
│ implement → │ │ implement → │
│ commit → │ │ commit → │
│ push → PR │ │ push → PR │
└──────┬──────┘ └──────┬──────┘
│ │
└──────────────┬───────────────────────┘
│ PR opened on GitHub
▼
┌─────────────────────────────────────────────────────────┐
│ GITHUB WEBHOOK HANDLER │
│ │
│ PR comment received → claims workspace → checks out │
│ branch → Dev Agent addresses feedback → pushes update │
└─────────────────────────────────────────────────────────┘
The database is the coordination layer. Agents communicate through task state, not message queues or shared memory. A task moves through pending → in_progress → pr_open → done. Blockers are a many-to-many relationship on the DevTask model. The Project Manager only dispatches tasks where all blockers have reached done. That's the entire dependency system.
How Each Agent Actually Works
The Tech Lead runs an agentic conversation loop using Claude Opus 4.6. It receives the project description, asks clarifying questions until it has enough signal, then writes a TechSpec — a structured markdown architecture document that becomes the README of the generated GitHub repo. From there, it calls upsert_task repeatedly to build the backlog, assigning priorities, implementation hints, and blocked_by references. The human approves the spec before anything runs.
The Dev Agents are stateless. Each one receives a task, claims a workspace using select_for_update(skip_locked=True) to prevent race conditions, and invokes claude --print with a carefully constructed prompt that includes the full TechSpec and a SKILL.md workflow guide injected into the repo. That guide tells the agent exactly how to work: how to structure its branches, how to write commit messages, when to ask versus when to decide. The agent creates a branch, implements the feature, commits, pushes, and opens a PR. The PR URL gets parsed from the Claude Code output and saved back to the DevTask.
The feedback loop is the part I'm most proud of. GitHub webhooks fire when a reviewer leaves a PR comment. A Celery task picks it up, claims a workspace, checks out the relevant branch, and invokes the Dev Agent again — this time with the PR_COMMENT_SKILL.md guide and the full comment thread as context. A regex heuristic classifies whether the comment requires a code change or just a reply. If changes are needed, the agent commits and pushes. No human routing required.
The Vocabulary Builder: A Real Test
I used ai-dev to build a complete vocabulary-learning app. Users select word categories, get daily curated words with Claude-generated definitions and example sentences, and receive push notifications. The stack I specified: Next.js 14 with App Router, Prisma ORM, NextAuth.js, Tailwind CSS, Anthropic SDK, and Web Push API.
I gave the Tech Lead that brief. It came back with clarifying questions — should words be cached or generated fresh each time? Is the notification delivery time user-configurable? How should the invite-only beta be gated? Those answers shaped the TechSpec. I approved it.
The agents then executed 7 tasks in dependency order:
- Project scaffolding and base configuration
- Database schema and Prisma models
- NextAuth.js authentication with email/password
- Category selection API and UI
- Claude word generation service
- Dashboard UI with WordCard component
- Web Push notification system with node-cron scheduler
Every task opened a PR. Every PR went through review. One PR (the word generation service) received feedback from me — the agent addressed it, committed the fix, and the PR merged cleanly. 100% merge rate, zero aborted tasks, zero blocking issues.
From blank repo to production-ready beta: the agents accumulated over 9,000 lines of changes across 63 files, including full Docker configuration for deployment. The active agent execution time was a matter of hours. The calendar time was longer — there are human approval gates in the workflow deliberately, because that's the point. I'm not trying to remove judgment from the loop. I'm trying to remove the mechanical execution that doesn't require it.
You can try the app at vocab.daisyhuang.dev and follow the build at github.com/dhuang-nyc/vocabulary-builder.
What Makes This Different From "AI Writes Code"
The distinction that matters is roles versus tools. Using Claude to autocomplete a function is using AI as a tool. This is using AI as a role-holder with defined responsibilities, handoff protocols, and feedback accountability.
The Tech Lead doesn't implement anything. The Dev Agents don't make architectural decisions. The dependency graph enforces sequencing without anyone managing it manually. The webhook integration means a reviewer's comment triggers automated remediation without anyone scheduling a follow-up. These are the properties of a functioning team, not a faster IDE.
The other thing that matters: stateless agents over long-running agents. Each Dev Agent invocation is a subprocess call. It reads current state from the database, does its work, writes results back, and exits. There's no shared session, no agent "memory" that can drift. The system is reproducible and debuggable in ways that persistent agent sessions typically aren't.
My Role in This System
I am the architect and director. I decide the stack. I define the constraints — "invite-only beta," "distraction-free word-per-day model," "SQLite for simplicity, not PostgreSQL." I approve the TechSpec before a single line runs. I review PRs when I want to. I handle the final polish: README updates, feature flag decisions, deployment configuration.
What I don't do: write application code, manage branches, handle PR logistics, track which tasks depend on which, or chase down review feedback. The system handles all of that.
That's the right division. Not "AI does everything" — that produces low-quality output with no accountability. And not "AI assists me" — that's just a better editor. The right frame is: I provide taste, judgment, and architectural direction. The agents provide execution at a pace and consistency I can't match alone.
What I'd Do Differently
Workspace management is the current bottleneck. The pool of isolated dev environments is fixed-size, which means task parallelism is capped by how many workspaces are available. Spinning up ephemeral environments on demand — rather than pre-allocating a pool — would unlock true parallel execution across large task backlogs.
The change-request heuristic (regex keyword matching to detect if a comment requires code changes) is also brittle. It works well in practice but fails on ambiguous comments. A small classifier call before invoking the Dev Agent would be more reliable and cost-negligible.
Who This Is For
Fellow AI engineers can fork this pattern. The architecture — DB-as-coordination-layer, atomic workspace claims, skill-file injection, webhook-driven feedback — generalizes to any multi-agent system where you want durable state, reproducible execution, and real review cycles. The specific stack (Django, Celery, Claude Code) is a choice, not a requirement.
For engineering leaders and founders: the implication isn't "fire your engineers." It's "what does leverage look like when one person with strong architectural judgment can direct a parallel execution pipeline?" The answer is: you can build faster, review more carefully, and ship more reliably than a small team under sprint pressure. The constraint shifts from velocity to taste.
What's Next
I'm watching how far the Tech Lead's planning capability can stretch. Right now it handles single-app scopes well. Multi-service architectures — where the TechSpec needs to coordinate across a backend API, a frontend, and an async worker — add complexity to the dependency graph that I'm working through. The interesting question isn't whether the agents can execute across services. They can. It's whether the planning agent can reason about cross-service contracts reliably enough to generate tasks that don't conflict.
That's the next problem worth solving.
Work With Me
I consult on AI system architecture and agentic workflow design for engineering teams and startups in New York City and remotely. If you're thinking about how to build with agents rather than just with AI tools, reach out at daisyhuang.dev.
Meta description: Daisy Huang, NYC AI engineer, built a multi-agent dev team using Claude Opus and Claude Code that plans, implements, reviews, and ships production software autonomously.