How We Actually Ship with AI

In our previous article, we shared a mindset: stop being your AI agent’s assistant. Stop validating every line. Stop babysitting. Instead, invest in your repository, make it the single source of truth, and let the agent work autonomously.

That article resonated, but it raised a fair question: what does the day-to-day actually look like?

This is that article. The concrete workflow we use to go from a vague idea to code running in production, with AI agents doing the heavy lifting.

We built CineMatch, an open-source movie-matching app, as a living practice repo. The patterns come from our production work, but CineMatch is where we demonstrate them publicly. Anyone can clone it, run it, and see the whole workflow in action.

The setup: one repo, any agent

Our team doesn’t use one tool. One of us works primarily in Claude Code. Another switches between opencode, Codex, and whatever ships that week. It doesn’t matter.

What matters is the repository. The agent conventions, the workflow descriptions, the development commands: they all live in files that any coding agent can read. There’s a CLAUDE.md at the root (symlinked to AGENTS.md) that describes the project, its conventions, and the available commands. There are skill files that describe repeatable workflows. There are justfile targets that wrap every development action into discoverable, one-word commands.

We chose to keep things simple. A coding agent that can read and edit files, run shell commands, and browse the web is enough for our workflow. So we invest our time in the repo itself: clear conventions, good test coverage, fast feedback loops. That structure is what makes any agent effective, regardless of which tool runs it.

Claude Code
Codex
opencode
any agent
your repo
CLAUDE.md conventions + commands AGENTS.md justfile discoverable targets .claude/skills/ reusable workflows .agents/skills/ docs/ specs, backlog, decisions
just check · just dev local feedback loop gh · aws clis for remote logs + debug playwright · agent-browser e2e in real browser
The investment is in the repo. The agent is interchangeable.

Three moves

Our delivery workflow has three phases. Each one maps to a single command and a git operation that moves an increment file through a directory:

docs/backlog/todo/          →  Idea becomes spec
docs/backlog/in-progress/   →  Spec becomes working code
docs/backlog/done/          →  Code becomes verified production
todo/
00003‑tmdb‑integration
Goal Replace static list
Ship criteria 5 items
Uncertainties 2 open
/backlog start
in‑progress/
00003‑tmdb‑integration
✅ Phase 1 · TMDB Client
☐ Phase 2 · Movie Cards
☐ Phase 3 · Detail Modal
+ test list per phase
/backlog done
done/
00003‑tmdb‑integration
✅ Phase 1 · TMDB Client
✅ Phase 2 · Movie Cards
✅ Phase 3 · Detail Modal
all tests green
One file, three directories. git mv preserves the full history.

The increment file travels with the work. git mv preserves its history. At any point, anyone on the team (human or agent) can open it and understand what’s happening, what’s been decided, and what’s left.

Move 1: Idea becomes spec

It usually starts with a sentence. Sometimes a half-formed thought, full of typos, spoken out loud to the agent:

“we need real movie data not the hardcoded list, something with posters and ratings, maybe trailers too, from tmdb”

The agent takes this and runs /backlog new tmdb-integration. What comes back is a structured increment file:

# TMDB API Integration

## Goal

Replace static movie list with dynamic TMDB-powered movie catalog
with real posters, IMDB ratings, and trailers.

## Context

Current implementation uses a hardcoded static list of 4 movies.
Users need variety and rich movie data for engaging swiping experience.

## Ship Criteria

- [ ] TMDB API client with caching
- [ ] Backend endpoints for fetching movies by region/provider
- [ ] Movie cards show: poster, title, year, IMDB rating, genre badges
- [ ] Detail modal: backdrop, full overview, trailer button
- [ ] "Swipe again" mode for fresh movies after exhausting current list

## Uncertainties

- [ ] Rate limits on TMDB free tier?
- [ ] Cache strategy (SQLite vs in-memory)?

## Implementation Plan

(filled in during Move 2)

What happened here is important. The agent didn’t just format our words. It made the implicit explicit. It read the existing codebase (the static movie list, the current card component, the database schema) and translated a vague idea into something actionable. It surfaced uncertainties we hadn’t thought to mention. It proposed ship criteria that reflect what “done” actually looks like.

This increment file is an artifact you can review. You can check how your intent was interpreted, what assumptions were made, and whether the scope feels right. It takes thirty seconds to scan. If something’s off, you say so and the agent adjusts.

Then it gets committed and pushed to main. Now the whole team sees it. Their agents see it too. If someone starts related work, the context is already there, not in a chat log that evaporated, but in git.

Move 2: Spec becomes working code

When we’re ready to start, the agent runs /backlog start. The increment file moves from todo/ to in-progress/, and the real work begins.

First, the agent fills in the Implementation Plan. This is where the increment file becomes a detailed roadmap, broken into phases. Each phase describes the work to be done, the technical decisions involved, and a list of tests that define the expected behaviors:

## Implementation Plan

### Phase 1: Backend TMDB Client ✅

- [x] Add TMDB API key to environment config
- [x] Create `backend/app/services/tmdb.py` client
    - `discover_movies(region, provider, page)` - fetch with filters
    - `get_movie_details(tmdb_id)` - full details + trailer
- [x] Cache responses in SQLite (24h TTL)
- [x] New endpoints:
    - `GET /api/v1/movies?region=US&page=1` - paginated movie list
    - `GET /api/v1/movies/{id}` - single movie details

#### Test list

- [ ] test TMDB client returns movies for valid region
- [ ] test caching prevents duplicate API calls within TTL
- [ ] test error handling for TMDB rate limits

### Phase 2: Frontend Movie Cards

- [ ] Update `MovieCard` component with poster, rating badge, genre badges
- [ ] Loading skeleton while images load
- [ ] Error fallback for missing poster

#### Test list

- [ ] test movie card renders poster from TMDB image URL
- [ ] test rating badge displays formatted score

This analysis and planning step, done before writing any code, has a noticeable positive impact on the quality of the agent’s work. The agent reads the existing codebase, identifies integration points, considers edge cases, and lays out a coherent sequence. The result is a plan that humans can review and that the agent can follow methodically.

The test lists, inspired by TDD, serve two purposes. For the agent, they provide clear, checkable feedback at each step: write the test, make it pass, check the box, move on. For humans, they act as executable documentation that reveals the intent behind the code. When you read test caching prevents duplicate API calls within TTL, you understand what the system does and why, without digging through the implementation.

But what makes the implementation work well isn’t just the planning. It’s the quality of feedback the agent gets along the way.

The feedback stack

We care about feedback quality: fast, local, and honest. In our context, a test that takes 30 seconds to fail is a poor signal. A test that takes 80 milliseconds to fail is a useful one.

Here’s what the agent has access to, ordered from fastest to most comprehensive:

LayerToolSpeedWhat it catches
Syntax & stylejust lint~100msImport errors, unused variables, style violations
Formattingjust fmt~100msInconsistent formatting
Typesjust typecheck~200msType mismatches, missing attributes
Unit testspytest tests/unit/~1sBusiness logic errors
Integration testspytest tests/integration/~5sDatabase behavior, real PostgreSQL via testcontainers
All testsjust test~8sUnit + integration together
Full checkjust check~15sLint + format + types + all tests, in fail-fast order
Smoke testspytest tests/smoke/~10sCritical paths after deployment (health, core flow)
E2E testsPlaywright / agent-browser~30sFull user journeys from a real browser
Preview deployCI/CD pipeline~3minInfrastructure, networking, real environment
lint · fmt
~100ms
typecheck
~200ms
unit tests
~1s
integration
~5s
smoke
~10s
e2e
~30s
preview
~3min
Fastest signal first. Cheap checks catch most errors before expensive ones run.

Every layer is wrapped in a just target. These targets are discoverable (just -l), repeatable, and identical whether a human or an agent runs them. The pre-commit hook runs just check automatically, so code that doesn’t pass never gets committed.

Local environment

The feedback stack above only works if the agent can run everything locally, quickly, and against a faithful replica of what will run in production.

One command starts the full stack:

just dev-local  →  PostgreSQL + Backend + Frontend ready in ~5s

Database migrations are evolutionary (Alembic), versioned, and tested. Integration tests spin up a real PostgreSQL via testcontainers for each test function. No mock databases. In our production projects, we also run local substitutes for cloud services we depend on (S3, SQS via LocalStack containers), so the agent works against a complete, realistic environment. The goal is a reproducible setup that starts fast, works the same for everyone on the team, and gives the agent as much honest feedback as possible. Every layer from the feedback stack, from just lint to just test, returns a real signal in seconds.

Beyond automated tests, the local environment also supports manual, exploratory testing by the agent. In our production projects, where users have accounts with different roles and subscription levels, we seed the local database with test accounts (just seed-test). Each account represents a specific user profile: a free user, a paying subscriber, an organization admin. This gives the agent a ready-made setup to open a browser, log in as a specific user, and explore targeted behaviors, without having to create accounts or set up state from scratch each time.

Preview apps: the safety net

When the agent pushes a branch and opens a PR, the CI/CD pipeline deploys a full preview environment at demo-pr-{N}.cinematch.umans.ai.

The key: preview and production use the exact same Terraform code, selected by workspace. Same VPC, same ECS Fargate, same RDS PostgreSQL. The only difference is sizing. This means minimal divergence between environments, and a trustworthy signal: if it works in preview, it works in production.

The agent can also debug at this level. gh CLI for workflow logs and PR checks, aws CLI for ECS logs and service status. If a deployment fails, the agent reads the logs, fixes the issue, and pushes again. Subsequent deploys take under 30 seconds. When the PR closes, the preview is automatically destroyed.

Move 3: Working code becomes verified production

The CI/CD pipeline is green. Every test in the increment file’s plan is checked off. The preview is running. (We’ve since added an agent-to-agent review step before merging, but that’s a story for another article.)

Now the agent verifies. It can open the preview URL in a browser, click through the user flows, confirm the new behavior works end-to-end. It can check application logs for errors. It can verify the database migration ran cleanly.

When the agent is satisfied, it runs /backlog done. The increment file moves from in-progress/ to done/, a git mv that preserves the full history of the file from the initial vague idea through every plan revision and completion note.

git mv docs/backlog/in-progress/00003-tmdb-integration.md \
       docs/backlog/done/00003-tmdb-integration.md
git commit -m "chore: complete tmdb-integration ✅"

Then it merges the PR. The same pipeline that deployed the preview now deploys to production. Same Docker images, same Terraform code, different workspace.

One thing we value about this approach: everything is in git. The increment file, the plan, the test list, the completion status. If a session ends mid-work (laptop dies, context window fills up, you switch machines) any agent can pick it up. Open the increment file in in-progress/, read the checked and unchecked boxes, and continue. No state lives in a chat window.

You don’t build this in a day

Reading all of this, you might think we designed the workflow upfront and implemented it top-down. We didn’t.

We started with just a prompt. Literally talking to the model, asking it to write code. No conventions file. No skills. No just targets.

Then we hit friction. The agent kept asking about the commit format, so we told it to document the convention in CLAUDE.md. It kept running the wrong test command, so we told it to add a justfile. We kept repeating the same backlog workflow steps, so we told it to write a skill for it. The agent forgot to run checks before committing, so we told it to add a pre-commit hook. We described the problem, the agent built the solution.

Each addition came from a real problem, not a plan. And each one compounded. Once the conventions file existed, the agent stopped asking questions. Once just check existed, every hook and CI step could call a single target. Once the backlog skill existed, the whole idea-to-production loop became three commands.

This is the first article’s philosophy in practice. When you adopt the posture of investing in your repository, treating it as the knowledge base that agents consume, the workflow builds itself. Each piece of friction you eliminate stays eliminated, for every future session, for every agent on the team.

The progression looked something like this:

  1. Just prompt : talk to the model, get code
  2. CLAUDE.md : document conventions so the agent stops asking
  3. justfile : wrap commands so they’re discoverable and repeatable
  4. Skills : encode multi-step workflows as reusable instructions
  5. Hooks : automate quality gates so nobody can skip them

You can start at step 1 today and let the rest grow from friction. You don’t need to adopt a system. You grow one.

What’s next

This workflow handles how we ship today: from idea to production, one increment at a time. But we’re exploring what happens when you push further.

We’re currently experimenting with shipping increments remotely (even when we’re not at our desk) and tackling larger migrations with patterns like Strangler Fig, characterization tests, and parallel runs. We’ll share what we learn as we go.

CineMatch is open source. Clone it, run /onboard, and see the workflow from the inside.


Appendix: anatomy of the agent folder

You don’t need all of this. But if you’re curious what the tooling supports, here’s the full surface area. Pick what solves a real problem for you.

your-project/
CLAUDE.md team instructions, committed CLAUDE.local.md personal overrides, gitignored .claude/ the control center .agents/
settings.json permissions + config, committed settings.local.json personal permissions, gitignored
commands/ custom slash commands
review.md → /project:review fix-issue.md → /project:fix-issue deploy.md → /project:deploy
rules/ modular instruction files
code-style.md testing.md api-conventions.md
skills/ auto-invoked workflows
security-review/SKILL.md deploy/SKILL.md
agents/ subagent personas
code-reviewer.md security-auditor.md
Everything your agents need to know about the project lives right here. Commit it to git.

External perspectives

Others are converging on similar ideas: