How We Actually Ship with AI
In our previous article, we shared a mindset: stop being your AI agent’s assistant. Stop validating every line. Stop babysitting. Instead, invest in your repository, make it the single source of truth, and let the agent work autonomously.
That article resonated, but it raised a fair question: what does the day-to-day actually look like?
This is that article. The concrete workflow we use to go from a vague idea to code running in production, with AI agents doing the heavy lifting.
We built CineMatch, an open-source movie-matching app, as a living practice repo. The patterns come from our production work, but CineMatch is where we demonstrate them publicly. Anyone can clone it, run it, and see the whole workflow in action.
The setup: one repo, any agent
Our team doesn’t use one tool. One of us works primarily in Claude Code. Another switches between opencode, Codex, and whatever ships that week. It doesn’t matter.
What matters is the repository. The agent conventions, the workflow descriptions, the development commands: they all
live in files that any coding agent can read. There’s a CLAUDE.md at the root
(symlinked to AGENTS.md) that describes the project, its
conventions, and the available commands. There are skill files that describe repeatable workflows.
There are justfile targets that wrap every development action into discoverable,
one-word commands.
We chose to keep things simple. A coding agent that can read and edit files, run shell commands, and browse the web is enough for our workflow. So we invest our time in the repo itself: clear conventions, good test coverage, fast feedback loops. That structure is what makes any agent effective, regardless of which tool runs it.
CLAUDE.md conventions + commands
↳ AGENTS.md
justfile discoverable targets
.claude/skills/ reusable workflows
↳ .agents/skills/
docs/ specs, backlog, decisions
just check · just dev local feedback loop
gh · aws clis for remote logs + debug
playwright · agent-browser e2e in real browser
Three moves
Our delivery workflow has three phases. Each one maps to a single command and a git operation that moves an increment file through a directory:
docs/backlog/todo/ → Idea becomes spec
docs/backlog/in-progress/ → Spec becomes working code
docs/backlog/done/ → Code becomes verified production
Ship criteria 5 items
Uncertainties 2 open
☐ Phase 2 · Movie Cards
☐ Phase 3 · Detail Modal
+ test list per phase
✅ Phase 2 · Movie Cards
✅ Phase 3 · Detail Modal
all tests green
git mv preserves the full history.The increment file travels with the work. git mv preserves its history. At any point, anyone on the team (human or
agent) can open it and understand what’s happening, what’s been decided, and what’s left.
Move 1: Idea becomes spec
It usually starts with a sentence. Sometimes a half-formed thought, full of typos, spoken out loud to the agent:
“we need real movie data not the hardcoded list, something with posters and ratings, maybe trailers too, from tmdb”
The agent takes this and runs /backlog new tmdb-integration. What comes back is a structured increment file:
# TMDB API Integration
## Goal
Replace static movie list with dynamic TMDB-powered movie catalog
with real posters, IMDB ratings, and trailers.
## Context
Current implementation uses a hardcoded static list of 4 movies.
Users need variety and rich movie data for engaging swiping experience.
## Ship Criteria
- [ ] TMDB API client with caching
- [ ] Backend endpoints for fetching movies by region/provider
- [ ] Movie cards show: poster, title, year, IMDB rating, genre badges
- [ ] Detail modal: backdrop, full overview, trailer button
- [ ] "Swipe again" mode for fresh movies after exhausting current list
## Uncertainties
- [ ] Rate limits on TMDB free tier?
- [ ] Cache strategy (SQLite vs in-memory)?
## Implementation Plan
(filled in during Move 2)
What happened here is important. The agent didn’t just format our words. It made the implicit explicit. It read the existing codebase (the static movie list, the current card component, the database schema) and translated a vague idea into something actionable. It surfaced uncertainties we hadn’t thought to mention. It proposed ship criteria that reflect what “done” actually looks like.
This increment file is an artifact you can review. You can check how your intent was interpreted, what assumptions were made, and whether the scope feels right. It takes thirty seconds to scan. If something’s off, you say so and the agent adjusts.
Then it gets committed and pushed to main. Now the whole team sees it. Their agents see it too. If someone starts related work, the context is already there, not in a chat log that evaporated, but in git.
Move 2: Spec becomes working code
When we’re ready to start, the agent runs /backlog start. The increment file moves from todo/ to in-progress/, and
the real work begins.
First, the agent fills in the Implementation Plan. This is where the increment file becomes a detailed roadmap, broken into phases. Each phase describes the work to be done, the technical decisions involved, and a list of tests that define the expected behaviors:
## Implementation Plan
### Phase 1: Backend TMDB Client ✅
- [x] Add TMDB API key to environment config
- [x] Create `backend/app/services/tmdb.py` client
- `discover_movies(region, provider, page)` - fetch with filters
- `get_movie_details(tmdb_id)` - full details + trailer
- [x] Cache responses in SQLite (24h TTL)
- [x] New endpoints:
- `GET /api/v1/movies?region=US&page=1` - paginated movie list
- `GET /api/v1/movies/{id}` - single movie details
#### Test list
- [ ] test TMDB client returns movies for valid region
- [ ] test caching prevents duplicate API calls within TTL
- [ ] test error handling for TMDB rate limits
### Phase 2: Frontend Movie Cards
- [ ] Update `MovieCard` component with poster, rating badge, genre badges
- [ ] Loading skeleton while images load
- [ ] Error fallback for missing poster
#### Test list
- [ ] test movie card renders poster from TMDB image URL
- [ ] test rating badge displays formatted score
This analysis and planning step, done before writing any code, has a noticeable positive impact on the quality of the agent’s work. The agent reads the existing codebase, identifies integration points, considers edge cases, and lays out a coherent sequence. The result is a plan that humans can review and that the agent can follow methodically.
The test lists, inspired by TDD, serve two purposes. For the agent, they
provide clear, checkable feedback at each step:
write the test, make it pass, check the box, move on. For humans, they act as executable documentation that reveals the
intent behind the code. When you read test caching prevents duplicate API calls within TTL, you understand what the
system does and why, without digging through the implementation.
But what makes the implementation work well isn’t just the planning. It’s the quality of feedback the agent gets along the way.
The feedback stack
We care about feedback quality: fast, local, and honest. In our context, a test that takes 30 seconds to fail is a poor signal. A test that takes 80 milliseconds to fail is a useful one.
Here’s what the agent has access to, ordered from fastest to most comprehensive:
| Layer | Tool | Speed | What it catches |
|---|---|---|---|
| Syntax & style | just lint | ~100ms | Import errors, unused variables, style violations |
| Formatting | just fmt | ~100ms | Inconsistent formatting |
| Types | just typecheck | ~200ms | Type mismatches, missing attributes |
| Unit tests | pytest tests/unit/ | ~1s | Business logic errors |
| Integration tests | pytest tests/integration/ | ~5s | Database behavior, real PostgreSQL via testcontainers |
| All tests | just test | ~8s | Unit + integration together |
| Full check | just check | ~15s | Lint + format + types + all tests, in fail-fast order |
| Smoke tests | pytest tests/smoke/ | ~10s | Critical paths after deployment (health, core flow) |
| E2E tests | Playwright / agent-browser | ~30s | Full user journeys from a real browser |
| Preview deploy | CI/CD pipeline | ~3min | Infrastructure, networking, real environment |
Every layer is wrapped in a just target. These targets are discoverable (just -l), repeatable, and identical whether
a human or an agent runs them. The pre-commit hook runs just check automatically, so code that doesn’t pass never gets
committed.
Local environment
The feedback stack above only works if the agent can run everything locally, quickly, and against a faithful replica of what will run in production.
One command starts the full stack:
just dev-local → PostgreSQL + Backend + Frontend ready in ~5s
Database migrations are evolutionary (Alembic), versioned, and tested. Integration
tests spin up a real PostgreSQL via testcontainers for each test function. No mock
databases. In our production projects, we also run local substitutes for cloud services we depend on (S3, SQS
via LocalStack containers), so the agent works against a complete, realistic
environment. The goal is a reproducible setup that starts fast, works the same for everyone on the team, and gives the
agent as much honest feedback as possible. Every layer from the feedback stack, from just lint to just test, returns
a real signal in seconds.
Beyond automated tests, the local environment also supports manual, exploratory testing by the agent. In our production
projects, where users have accounts with different roles and subscription levels, we seed the local database with test
accounts (just seed-test). Each account represents a specific user profile: a free user, a paying subscriber, an
organization admin. This gives the agent a ready-made setup to open a browser, log in as a specific user, and explore
targeted behaviors, without having to create accounts or set up state from scratch each time.
Preview apps: the safety net
When the agent pushes a branch and opens a PR, the CI/CD pipeline deploys a full preview environment at
demo-pr-{N}.cinematch.umans.ai.
The key: preview and production use the exact same Terraform code, selected by workspace. Same VPC, same ECS Fargate, same RDS PostgreSQL. The only difference is sizing. This means minimal divergence between environments, and a trustworthy signal: if it works in preview, it works in production.
The agent can also debug at this level. gh CLI for workflow logs and PR checks, aws CLI for ECS logs and service
status. If a deployment fails, the agent reads the logs, fixes the issue, and pushes again. Subsequent deploys take
under 30 seconds. When the PR closes, the preview is automatically destroyed.
Move 3: Working code becomes verified production
The CI/CD pipeline is green. Every test in the increment file’s plan is checked off. The preview is running. (We’ve since added an agent-to-agent review step before merging, but that’s a story for another article.)
Now the agent verifies. It can open the preview URL in a browser, click through the user flows, confirm the new behavior works end-to-end. It can check application logs for errors. It can verify the database migration ran cleanly.
When the agent is satisfied, it runs /backlog done. The increment file moves from in-progress/ to done/, a
git mv that preserves the full history of the file from the initial vague idea through every plan revision and
completion note.
git mv docs/backlog/in-progress/00003-tmdb-integration.md \
docs/backlog/done/00003-tmdb-integration.md
git commit -m "chore: complete tmdb-integration ✅"
Then it merges the PR. The same pipeline that deployed the preview now deploys to production. Same Docker images, same Terraform code, different workspace.
One thing we value about this approach: everything is in git. The increment file, the plan, the test list, the
completion status. If a session ends mid-work (laptop dies, context window fills up, you switch machines) any agent can
pick it up. Open the increment file in in-progress/, read the checked and unchecked boxes, and continue. No state
lives in a chat window.
You don’t build this in a day
Reading all of this, you might think we designed the workflow upfront and implemented it top-down. We didn’t.
We started with just a prompt. Literally talking to the model, asking it to write code. No conventions file. No skills.
No just targets.
Then we hit friction. The agent kept asking about the commit format, so we told it to document the convention in
CLAUDE.md. It kept running the wrong test command, so we told it to add a justfile. We kept repeating the same
backlog workflow steps, so we told it to write a skill for it. The agent forgot to run checks before committing, so we
told it to add a pre-commit hook. We described the problem, the agent built the solution.
Each addition came from a real problem, not a plan. And each one compounded. Once the conventions file existed, the
agent stopped asking questions. Once just check existed, every hook and CI step could call a single target. Once the
backlog skill existed, the whole idea-to-production loop became three commands.
This is the first article’s philosophy in practice. When you adopt the posture of investing in your repository, treating it as the knowledge base that agents consume, the workflow builds itself. Each piece of friction you eliminate stays eliminated, for every future session, for every agent on the team.
The progression looked something like this:
- Just prompt : talk to the model, get code
CLAUDE.md: document conventions so the agent stops askingjustfile: wrap commands so they’re discoverable and repeatable- Skills : encode multi-step workflows as reusable instructions
- Hooks : automate quality gates so nobody can skip them
- …
You can start at step 1 today and let the rest grow from friction. You don’t need to adopt a system. You grow one.
What’s next
This workflow handles how we ship today: from idea to production, one increment at a time. But we’re exploring what happens when you push further.
We’re currently experimenting with shipping increments remotely (even when we’re not at our desk) and tackling larger migrations with patterns like Strangler Fig, characterization tests, and parallel runs. We’ll share what we learn as we go.
CineMatch is open source. Clone it, run /onboard, and see the workflow
from the inside.
Appendix: anatomy of the agent folder
You don’t need all of this. But if you’re curious what the tooling supports, here’s the full surface area. Pick what solves a real problem for you.
CLAUDE.md team instructions, committed
CLAUDE.local.md personal overrides, gitignored
.claude/ the control center
↳ .agents/
settings.json permissions + config, committed
settings.local.json personal permissions, gitignored
commands/ custom slash commands
review.md → /project:review
fix-issue.md → /project:fix-issue
deploy.md → /project:deploy
rules/ modular instruction files
code-style.md
testing.md
api-conventions.md
skills/ auto-invoked workflows
security-review/SKILL.md
deploy/SKILL.md
agents/ subagent personas
code-reviewer.md
security-auditor.md
External perspectives
Others are converging on similar ideas:
- Harness Engineering — OpenAI’s take on structuring repos so AI agents can work effectively
- Skills for the Agents SDK — how OpenAI approaches reusable agent workflows