The Codex Loop: A Native App Dev Playbook
The bug that wrote this playbook
A previous build of the same app shipped 8 phases with green ticks. 61 tests passing. Screenshots in every chunk report. Done markers everywhere.
First launch was a black screen.
The contract had drifted three phases earlier. The same context that introduced the drift was the one validating the work. Self-audit can't catch this. The degraded session that produced the bug is the same one evaluating it.
The Codex loop is the answer. Separate the brain from the hands.
The split
The planning model holds the brain. Spec. Plan. Briefs. Verification judgment. One consistent context across the whole project.
The executor model does the hands. One chunk at a time. Tight contract. No memory between chunks. Cheap to dispatch, ruthless about scope.
You do the taste gates. Final eyeballs on visuals. Approval to proceed.
That's it. The rest is mechanics.
Shipped end-to-end on a recent native Mac app over three days. Two phases tagged clean. 123 tests green. Atmospherics visually approved on first build. Zero rebuilds.
The five artifacts
Every project produces these in order. Each is a distinct file with a distinct role.
1. The spec
What the thing is. Architecture. Decisions. Component table. The atmospheric and craft details.
Two inventories the spec must include:
- Craft inventory. The prototype details that aren't in the original brief. Every glow, asymmetry, animation curve. The stuff a wireframe spec misses and a built thing has.
- Staleness inventory. Paths, tools, values that drifted since prior work. Catch this before implementation, not during.
2. The plan
How to build it. Tasks. Tests. Commits.
One plan per phase. Each task is two to five minutes of work. Every task lists files-to-touch, steps, tests, and the exact commit message to use.
TDD discipline baked in. Write the failing test, run, implement, run, commit. The test is the spec.
A self-review checklist at the end maps every spec requirement to a task. If a requirement has no task, the plan isn't done.
3. The briefs
Dispatchable units of plan. One brief per chunk. Five to eight plan tasks each.
Frontmatter declares the phase, the plan tasks covered, what it depends on, and the executor tier. Body covers objective, scope, hard rules, stop conditions, and the done signal.
Briefs stay 60 to 95 lines. If a brief grows past that, it should have been two chunks.
4. The prompts
The actual messages sent to the executor. Every prompt is paste-ready. No fill-in-the-blanks.
Two flavours:
- Continuation prompt. "Chunk N verified. Proceed to chunk N+1, read brief X, same rules. Stop when done."
- Fresh-session prompt. Fully self-contained. Project context. All inputs to read. Full state verification. Hard rules. Stop conditions. Done signal. The executor can pick it up cold.
5. The handoff
End-of-session continuity doc. Written at natural break points or before context approaches limits.
Where things are right now. The next prompt to paste. Decisions made this session. Open items.
The next day's session starts cold and reads the handoff like a contract.
The dispatch loop
Verification is non-optional. Trust commit messages, but verify:
- Git log matches expected commit messages exactly.
- The file system has the files the brief promised.
- The build runs the test count the brief promised.
- Snapshot baselines committed without
isRecording = trueleft on. - For visual chunks: launch the app and look at it.
The brain holds the contract. The hands ship the code. The eyes approve the result.
Slice mode
The executor's auto-compaction failed mid-chunk twice during one phase. The pattern was clean:
- Chunks of 3 to 4 tasks: stable.
- Chunks of 5 or more tasks: about half the time, compaction crashes near task 5 or 7.
The fix is conceptual. Per-prompt scope is the rate limiter, not the chunk size. Briefs stay as documentation of milestone-shaped work. Prompts dispatch slices.
For a 7-task chunk, dispatch becomes:
- Prompt A: tasks 1 to 4. Stop. Wait.
- Verify. Paste prompt B.
- Prompt B: tasks 5 to 7. Stop. Wait.
- Verify. Move on.
The dispatch overhead of one extra continue prompt is much smaller than recovering from a compaction crash mid-chunk.
Plan corrections in flight
When the executor hits a divergence between spec and reality, good executor behaviour is to stop and surface.
This happened four times during phases 1A and 1B. It was the most valuable signal in the whole build.
The process:
- The executor stops and surfaces the divergence with evidence.
- The planner inspects. Git status, file diffs, test failures.
- The planner rules. Was the spec wrong, did the executor misread, or did reality change?
- If the spec was wrong, fix it inline and document in the handoff.
- Send a corrected prompt with the new authoritative info.
Real examples:
Token drift. The plan transcribed colour hex values best-effort from a digest of the design reference. The executor caught that 11 of 17 didn't match the canonical CSS. Ruling: CSS is canonical. Plan and tests updated inline. Execution resumed. Cost: 30 minutes. Saved: hours of wrong-colour rebuilds later.
Property wrappers on computed properties. The plan used a @Published pattern that doesn't compile in Swift. Property wrappers require stored properties. Ruling: switch to @Published stored with didSet for persistence. Plan corrected. Execution resumed. Cost: 15 minutes.
Polling test flakiness. The executor caught a timeout-versus-interval mismatch on the second test failure. Ruling: inject a deterministic dispatch timer. The executor chose the cleaner option. Cost: 20 minutes. Made every future polling test stable.
Snapshot regression risk. When a composition view replaced a frame, the earlier snapshot test would compare against the new rendering. The executor anticipated this before it broke and added an override path so the old snapshot kept passing. Foresighted recovery. Cost: zero. Preempted.
The pattern: don't paper over divergences. They surface real bugs in the contract. Treat them as signal, not noise.
Visual gates
Some chunks need human eyeballs, not test assertions.
Examples:
- The atmospherics chunk. Vignette top-anchored versus corner-anchored is invisible to test code.
- The terminal chunk. Glow treatment on the embedded TUI versus default.
- The dispatch baselines. Phase badge colours look right per the spec.
- The integration chunk. Does the whole thing feel built, or feel like a wireframe.
The pattern: the brief's done signal includes "Stop and ask for visual approval before declaring done." Don't auto-approve based on commit messages.
Test discipline
TDD as contract. The plan specifies tests first, then implementation. The test is the spec. Hard rule for the executor: don't change a test to make it pass. Fix the implementation, or surface the question.
This made the executor push back on three real plan errors instead of silently working around them.
Snapshot tests with recording discipline. Uncomment isRecording = true once to capture a baseline. Re-comment it before commit. Eyeball every baseline before it lands. Baselines become truth, so don't auto-rebaseline on regression. Surface the diff.
Manual smoke tests for the un-unit-testable. Metal kernels rendering. Embedded terminals spawning. Drag-and-drop. Live polling. The plan instructs the executor to launch the app and confirm specific properties before declaring a task done. The smoke step lives in the brief's done signal so it can't be skipped.
Deterministic timers in tests. Real timers make tests flaky. Inject the timer as a closure or protocol so tests can advance time. Apply once after the first flake, not after the third.
Executor tier mapping
| Tier | Use for |
|---|---|
| Medium | Boilerplate views. Value types and enums. Settings panes. Most of the work. |
| High | Cross-API choreography. Three-or-more-framework chunks. Multi-step contract refactors. The architectural-judgment chunks. |
Default to medium. Reach for high only when the chunk crosses three or more frameworks at once, or when it's the chunk where a class of bug previously bit.
Anti-patterns avoided
Self-audit. The executor evaluating its own work. Solved by the planner running verification with fresh context.
Done markers without proof. Every chunk's done signal includes test count, commit hashes, file paths, and visual evidence.
Plan-of-the-plan. Resist writing a separate methodology doc on top of the spec when the spec is already implementation-detailed. Save that for the playbook after the build, not before.
Blind retry on flaky tests. When the polling test failed, the response was not "run it again." It was "diagnose the root cause and architect the fix."
Anti-patterns that bit
Hardcoded paths from prior environments. Old absolute roots blocked dispatch on the migrated machine. Fix: env-var roots with auto-detect fallback. Never hardcode an absolute path. Always derive from the script's own location with an env var override.
Best-effort token transcription. Taking design tokens from a text digest instead of canonical CSS. Fix: treat the design reference itself as canonical. Read the prototype, don't paraphrase it.
Property wrappers on computed properties. They don't compile in Swift. Always store, then persist via didSet.
Day-one setup for a new app
To bootstrap this loop on a new Apple-platform project:
- Brainstorm the spec with the planner.
- Write the phase 1 plan with the planner.
- Carve the plan into briefs.
- Open a fresh executor session per chunk. Paste prompts in slices of three to four tasks.
- Verify after each slice. Git, tests, files, eyeballs.
- Commit a handoff at session end.
Repeat steps four to six until the phase ships. Then plan the next phase against the actual built thing, not against the spec's idealised vision.
The takeaway
The brain holds the contract. The hands ship the code. The eyes approve the result. Three roles, three contexts, one product.
Self-audit is a trap. The session that wrote the bug is the worst session to evaluate it. Verification has to come from a context that didn't produce the work.
Slice for compaction. Treat divergences as signal. Eyeball the visuals. Write the handoff before the session is done with you.
The shape stays stable: separate brain from hands, slice for safety, verify before approving, treat divergences as the most valuable signal in the build.