Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review

Latent Space · Latent Space hosts — Ryan Lopopolo (OpenAI Frontier & Symphony) · April 7, 2026 · Original

Most important take away

The era of humans writing production code is ending for certain team structures. Ryan Lopopolo’s team at OpenAI shipped over 1 million lines of code across 1,500 PRs with zero human-written code and now zero human code review, achieving roughly 5-10x the productivity of a single engineer. The key insight is that the engineer’s role shifts from writing code to systems thinking — constantly identifying where agents fail, encoding non-functional requirements as documentation and tests, and building scaffolding that makes the agent more autonomous over time.

Summary

Ryan Lopopolo, from OpenAI’s Frontier Product Exploration team (previously at Snowflake, Brex, Stripe, Citadel), spent five months building an internal Electron app with a hard constraint: he would write zero lines of code himself. Everything was done through Codex. The first 1.5 months were 10x slower than manual coding, but the investment in tooling and scaffolding paid off, eventually reaching 5-10x faster than a single engineer.

Actionable insights for engineers and teams:

Treat code as disposable. If an agent’s PR is bad, trash the entire worktree and start from scratch. Your low investment in authorship makes this painless. This mindset unlocks massive parallelism.
Encode all non-functional requirements as text the agent can see. Every mistake the agent makes reveals an unwritten standard. Write it down as docs, lints with helpful error messages, or test cases. This is the core loop of “harness engineering.”
Keep build times under one minute. The team migrated from Makefile to Bazel to Turbo to NX specifically because background shells in newer Codex models made the agent impatient with long builds. Fast feedback loops are critical for agent productivity.
Use only 5-6 skills (structured markdown prompts). Keep the agent’s instruction surface small and pour all team taste into those few skills rather than sprawling documentation.
Set up local observability from day one. They built a full metrics/traces/logs stack (Victoria metrics, etc.) in half an afternoon using MESSE. The agent needs to see its own output to self-correct.
Let review agents merge autonomously but instruct them to bias toward merging and only flag P0-P1 issues. Give code-authoring agents permission to push back on review feedback to prevent infinite loops of non-converging changes.
Internalize low-to-medium complexity dependencies (“ghost libraries”). When code is cheap to produce, you can strip a dependency down to only what you need, making security review and updates far simpler. This works for dependencies up to a few thousand lines currently.
Collect agent session logs at the team level. They aggregate all Codex trajectories into blob storage and run daily agent loops to identify team-wide improvement opportunities, then reflect those back into the repo.

Career advice: The role of the software engineer is shifting from code author to something closer to a group tech lead of a 500-person org. You don’t review every PR; you sample representative code, identify systemic issues, and invest in tooling and architecture. Don’t bet against the models — they are pushing into higher complexity tasks with each release. The humans who thrive will be the ones working on pure whitespace problems and the deepest architectural refactorings.

Symphony is an Elixir-based orchestration system the team built to remove humans from the terminal loop entirely. It manages agent lifecycles: spawning Codex instances, driving PRs to merge, handling flakes, rebasing, and escalating to humans only for binary merge/rework decisions. The model chose Elixir because its process supervision and gen_servers naturally map to agent orchestration. This pushed the team from 3.5 PRs/engineer/day to 5-10+.

OpenAI Frontier is the enterprise platform for deploying agents safely at scale. It provides governance dashboards, safety spec integration (via the GPT-OSS safeguard model), connector frameworks for enterprise IAM and security tooling, and deep agent trajectory inspection. The buyer personas are IT, GRC/governance teams, AI innovation offices, and security teams — the people accountable for safe deployment, not just the end users.

On model trajectory: Models still struggle with net-new product prototyping (going from mockup to playable product in one shot) and the gnarliest architectural refactorings. But the complexity ceiling rises with each model release. GPT-5.4 is highlighted as the first model merging top-tier coding with top-tier general reasoning in one model, with 1M token context being a game-changer for long agentic sessions.

No investments or stocks were specifically recommended.

Chapter Summaries

Chapter 1: The Zero Human Code Constraint

Ryan introduces his role at OpenAI Frontier Product Exploration and explains the self-imposed constraint of writing zero code. Starting with early Codex CLI and less capable models, the first month was painfully slow (10x slower), but the team built small building blocks and decomposed tasks whenever the model failed, creating an “assembly station” that eventually surpassed individual engineer productivity.

Chapter 2: Build System Evolution and Agent Patience

The team migrated through four build systems (Makefile to Bazel to Turbo to NX) to keep builds under one minute. This was necessary because newer Codex models with background shell support became less patient with blocking operations. The one-minute constraint acts as a ratchet, forcing continuous build graph decomposition and maintaining fast inner loops.

Chapter 3: Removing Humans from Code Review

Human review became the bottleneck at 1,500 PRs. The team moved to post-merge review only, treating it more as spot-checking than gatekeeping. They deployed automated review agents on PRs, instructing them to bias toward merging and only surface high-priority issues. Code-authoring agents were given permission to defer or push back on reviewer feedback to prevent non-converging loops.

Chapter 4: Scaffolding, Skills, and Agent Context

Instead of putting agents in predefined state-machine boxes, the team gives the reasoning model the full context and lets it choose how to proceed. They use a short agent.md, a core_beliefs.md (containing team info, product vision, customer segments), and about six focused skills. A quality score tracker skill lets Codex review business logic against documented guardrails and propose follow-up work for itself.

Chapter 5: Local Observability and Owning the Full Loop

The team built a local observability stack (traces, logs, metrics) in half an afternoon. The agent authors Grafana dashboard JSON, publishes dashboards, and responds to pages. When an outage occurs, the agent already knows which dashboards, alerts, and exact log lines in the codebase are relevant, enabling it to fix gaps in monitoring and code simultaneously.

Chapter 6: Symphony — Orchestrating Agent Swarms

Symphony is an Elixir-based system built to remove humans from the terminal. The model chose Elixir for its process supervision primitives. It manages the full PR lifecycle: spawning Codex, pushing PRs, waiting for CI, fixing flakes, merging, and escalating to humans only for merge/rework decisions. A rework state trashes the entire worktree and restarts from scratch. This pushed throughput from 3.5 to 5-10+ PRs per engineer per day.

Chapter 7: Ghost Libraries and Internalizing Dependencies

The team advocates internalizing low-to-medium complexity dependencies (up to a few thousand lines). By stripping away generic parts and keeping only what’s needed, security review becomes simpler and patching is lower friction. They distribute their own work as “specs” (ghost libraries) — documents detailed enough for a coding agent to reproduce the system locally.

Chapter 8: Self-Improving Agent Teams

Agent session logs from all team members are collected into blob storage. Daily agent loops analyze these trajectories to find team-wide improvement opportunities, which are reflected back into the repository. PR comments, failed builds, and review feedback are all treated as signals that the agent was missing context, feeding a continuous improvement cycle.

Chapter 9: What Models Still Cannot Do

Models struggle with net-new product prototyping (translating a mockup into a playable product in one shot) and the most complex architectural refactorings. However, each model release pushes into higher complexity. GPT-5.4 is the first to combine top-tier coding and general reasoning, and its 1M token context dramatically extends agentic session length before compaction.

Chapter 10: OpenAI Frontier — Enterprise Agent Platform

Frontier is OpenAI’s enterprise platform for deploying agents safely at scale with governance, observability, and safety specs. It integrates with enterprise IAM, security tooling, and workspace tools. The dashboard lets IT/GRC/security teams drill down from organizational agent activity to individual agent trajectories. The Agents SDK is a core component enabling both startups and enterprises to build reliable, composable agents.

Chapter 11: Closing Thoughts and the Future

Ryan emphasizes building “on-policy” harnesses that work with the model’s natural output (code, tests) rather than constraining it with external scaffolds. The Codex team’s relentless shipping pace (GPT-5.3, Spark, GPT-5.4 within roughly a month) is accelerating everything. OpenAI is hiring for the new Bellevue office, and Codex has passed 2 million weekly active users growing 25% week over week.