№ 05Learn With Darin · Field Guide
Codex: a practitioner’s field guide.
OpenAI's two coding agents under one name: the codex CLI in your terminal and the cloud agent at chatgpt.com/codex. What each one's good at, what it costs, and which tasks belong to which.
What "Codex" means in 2026
"Codex" has been three different products at OpenAI over five years. If you don't know which one someone is talking about, you'll guess wrong half the time. The current state:
- Codex CLI: a terminal coding agent, OpenAI's answer to Claude Code. You install it, you run
codex, it edits files in your repo, it iterates on code. Local, sandboxed, runs against your laptop. - Codex (cloud): a SaaS coding agent at
chatgpt.com/codex. You connect a GitHub repo, give it a task, and it works in a cloud sandbox to produce a pull request. Distributed, parallel-capable, runs in OpenAI's environment. - Codex IDE extensions: the same agent surfaced in VS Code, JetBrains, and Cursor, sharing your task list across surfaces. Same account, same models.
(The Codex API from 2021, the original code-completion model, was deprecated in 2023 and is gone. If you read an article that mentions "Codex" without a date, suspect the article.)
The thing to internalize: the CLI and the cloud agent are not competitors—they're collaborators. The CLI is for the work you want to watch happen on your machine. The cloud agent is for the work you want to outsource to a sandbox while you do something else. Knowing when to reach for which is the whole game.
The Codex CLI
If you've used Claude Code, the Codex CLI will feel familiar in shape: a terminal-resident agent that reads your repo, plans changes, edits files, and runs commands inside a sandbox. The vocabulary is different but the loop is the same. Prompt, plan, edit, verify, repeat.
Install & run
Codex CLI ships as a single binary on macOS, Windows, and Linux. Install via the OpenAI installer or via Homebrew on Mac. On Windows you have two paths: native PowerShell with the Windows sandbox, or WSL2 for a Linux-native environment. Both work; pick based on which one matches your existing dev setup.
# macOS / Linux curl -sSL https://codex.openai.com/install.sh | sh # macOS via Homebrew brew install openai/tap/codex # Windows (PowerShell) iwr -useb https://codex.openai.com/install.ps1 | iex # Verify codex --version codex login
The first codex invocation in a project does a one-time scan and prompts you to write an AGENTS.md. Just like Claude Code's CLAUDE.md, this is the project-level instructions file. Unlike CLAUDE.md, AGENTS.md is cross-tool. Codex, Cursor, Aider, and an increasing number of other agents all read the same file. If you've ever wanted one config file your tools agree on, this is the closest thing to it.
The model picker
Codex CLI lets you switch models with /model. Defaults and options as of May 2026:
- GPT-5.5: the recommended frontier model for complex work. Default on most installs.
- GPT-5.4: slightly older, faster, cheaper per-token. Good for repetitive edits.
- GPT-5.3-Codex: code-tuned variant of GPT-5.3, lighter than 5.5 but still solid for everyday coding.
- Reasoning levels: for the frontier models you can set
/reasoning low|medium|high; high tells the model to think longer before acting.
For most tasks, leave it on GPT-5.5 medium and don't worry about it. The reasoning-level lever matters when you're doing something genuinely novel: refactors that span the codebase, debugging that needs hypotheses, architectural changes. For tweaks, low or medium is faster.
Sandboxing
Codex CLI's sandbox is one of its strongest features. By default, every command Codex wants to run is constrained:
macOS
- Uses Apple's built-in Seatbelt framework, the same sandbox that powers App Sandbox.
- File access is scoped to the working directory by default; network is gated.
- Permission prompts surface inline in the terminal: "Codex wants to run X. Allow?"
- No additional install needed; Seatbelt is part of macOS.
Windows / Linux
- Windows: native Windows Sandbox when you're in PowerShell, or Linux sandbox when you're in WSL2.
- Linux: namespace + cgroup isolation similar to a container.
- The Windows Sandbox feature has to be enabled in Windows Features. Codex prompts you on first run if it's off.
- Network is gated by default in all profiles.
Permission profiles you can choose between (codex --profile):
- workspace: read/write within the working directory only, no network. The default and the safe choice.
- default: slightly broader access than workspace; can read system files but still gated.
- full-access: the escape hatch. Codex runs without the sandbox. Use only for tasks that genuinely need it (e.g. spinning up a subprocess that itself needs root). Codex prompts you every session before allowing this.
full-access the way you'd treat sudo: minimum necessary, never as a default.
The CLI loop in practice
Day-to-day, the loop looks like this:
- You're in a project. You run
codex. - You describe the task: "Add a search box to the header that filters the list view as I type."
- Codex plans the change, lists the files it intends to touch, and asks for confirmation.
- It edits files, runs your tests (if it found them), and reports back.
- You inspect the diff, accept or reject, iterate.
Useful slash commands inside the session:
/model gpt-5.5: switch the active model./reasoning high: bump the reasoning level for the next turn./profile workspace: change the sandbox profile./diff: show the working diff Codex has accumulated./checkpoint: snapshot the current state; you can roll back to it./history: replay recent commands and decisions./agents: list the AGENTS.md context Codex is operating against.
How it differs from Claude Code
The honest comparison, from someone who uses both:
| Dimension | Codex CLI | Claude Code |
|---|---|---|
| Surface | Terminal-only as of May 2026; IDE extensions surface the cloud agent rather than the CLI | Terminal + VS Code + JetBrains + desktop app + web |
| Config file | AGENTS.md (cross-tool, also read by Cursor, Aider, others) | CLAUDE.md (Anthropic-only) |
| Plugin model | Skills (cloud); built-in tools in CLI | MCP servers, slash commands, subagents, hooks, plugins, skills |
| Sandbox model | Seatbelt / Windows Sandbox / WSL2 native, minimal config, very tight defaults | Permission prompts per-action; sandbox via OS native is opt-in via settings |
| Default model | GPT-5.5 | Sonnet 4.6, Opus 4.7 on opt-in extended thinking |
| Where it shines | Terminal-native tasks (scripts, system admin, DevOps); long sandboxed runs | Code review, complex refactors, multi-file architecture changes, hooks-based workflows |
| SWE-bench (May 2026) | Strong on Terminal-Bench 2.0 (~77%) | Leading on SWE-bench Verified with Opus 4.6 (~81%) |
Both tools are excellent. The differences are real but smaller than the marketing material suggests. Most engineers who use both end up keeping both: Claude Code for ambitious feature work and code review, Codex CLI for terminal automation and "I trust the sandbox enough to walk away" tasks.
Codex cloud (chatgpt.com/codex)
The cloud agent is the more transformative product, even if the CLI gets more attention. It's the first widely available "agent that opens PRs" that actually works for the everyday case.
The model
You connect a GitHub repository (or repositories) and give Codex a task (typically as an issue link or a prose description). Codex spins up a sandboxed cloud environment with your repo preloaded. It works through the task in the background. When it's done, you get a pull request with the changes, ready for review.
What's different from "let me autocomplete in my IDE":
- Tasks run in parallel. You can dispatch four tasks across four cloud sandboxes; you don't sit at your machine waiting for one to finish.
- Tasks have their own environment. Codex installs your dependencies, runs your tests, and iterates against them, all without touching your laptop.
- Tasks produce reviewable artifacts. The output is a PR, not a chat transcript. The diff is the deliverable.
- Tasks can be reviewed and re-run. If the PR isn't right, you leave a comment ("the test for X is wrong because…") and Codex iterates.
The dispatch loop
- From
chatgpt.com/codex, the dashboard, an open GitHub issue, or your IDE extension, you create a task. - Codex picks up the task, identifies the relevant files, and starts work in a fresh sandbox with your repo cloned.
- It iterates: write code, run tests, inspect failures, fix, retry. Most non-trivial tasks involve several internal iterations.
- When it converges (or gives up), it pushes a branch and opens a PR.
- The task card in your dashboard shows status badges: draft, open, merged, closed.
- You review the PR like any other PR. Comments → iteration. Merge → done.
Skills
One of Codex cloud's distinguishing features is Skills: reusable patterns the agent can apply across tasks like code understanding, prototyping, documentation, code review, and migrations. You can author Skills aligned with your team's standards (e.g. "follow our error-handling convention," "always add types in this style") and Codex will apply them across all tasks for that repo.
The mental model: Skills are to the cloud agent what AGENTS.md is to the CLI. Encoded team conventions that travel with the code, not with the engineer.
Where the cloud agent shines
- Routine PRs that aren't worth your attention: dependency bumps with non-trivial code changes, lint cleanups, test scaffolding, simple migrations.
- Bugs with a reproduction: file an issue with a failing test, dispatch to Codex, get a PR back.
- Documentation drift: "the README is out of date for these features" works well as a Codex task.
- Parallel exploration: dispatch four variations of a feature implementation; pick the one whose PR you like best.
- Things that need a long sandboxed run, like large refactors, codebase-wide rename operations, or multi-step investigations, where you don't want to babysit a CLI.
Where the cloud agent struggles
- Tasks with unclear acceptance criteria: if you can't write a failing test, the cloud agent guesses at what "done" means. The result is plausible but often wrong in subtle ways.
- Tasks requiring repository context the agent can't see: secrets, private services, custom build steps. The sandbox is isolated by design.
- UI work: the cloud sandbox has no browser to render your app. It writes the code; verifying it usually still requires a human running the change locally.
- Codebases with custom toolchains that aren't in the sandbox image. You can extend the image, but it's friction.
IDE extensions
Codex extensions exist for VS Code, JetBrains IDEs, and Cursor as of May 2026. The framing matters:
- The extensions are not "the CLI in an IDE." They're a UI for the cloud agent that happens to live in your editor.
- You see your task list, you can dispatch new tasks, you can review PRs in-IDE, and you can leave comments that Codex iterates on, all from your editor.
- Local edits in your editor still happen via the CLI (or via the cloud agent's PR review flow once a PR is up).
The mental model that works: extensions are a productivity layer for the cloud agent—not a replacement for the terminal CLI. If you want a tool that edits your local files in your editor, that's still Cursor or Claude Code or the underlying editor's own AI. Codex's IDE story is about coordinating cloud-agent work without leaving your editor.
Choosing between CLI and cloud (and Claude Code)
Most engineers using Codex seriously end up running both flavors plus Claude Code. The split:
Use the CLI when the task should happen on your machine.
Working with local services, reading files outside the repo, debugging something running on your laptop, generating code that needs to be tested with your environment, terminal automation. The CLI's tight sandbox plus Seatbelt/Windows Sandbox is the right primitive when "your machine" is the runtime.
Use the cloud agent when the task can run in a sandbox.
If your repo's tests run in CI, the cloud agent can run them too. Anything that's "fork the repo, make a change, get a PR" is a cloud task. Bonus when you can dispatch it and walk away. The cloud's serial latency doesn't bother you if you're not waiting.
Use Claude Code when you want a tightly collaborative loop.
For high-stakes refactors, code review, or complex architecture changes (anywhere you want to be in close dialogue with the agent), Claude Code's hooks, subagents, and IDE integrations have more breadth. Use Codex cloud when you want to delegate; use Claude Code when you want to collaborate.
For pure terminal automation, default to Codex CLI.
DevOps scripts, log analysis, "fix this Bash one-liner," shell-resident workflows. Both Claude Code and Codex CLI are excellent here, but Codex is the slightly better terminal-native operator on benchmarks and in practice.
Maintain one AGENTS.md and one CLAUDE.md per project.
If you're using multiple agents, write the cross-tool conventions in AGENTS.md (Codex, Cursor, Aider all read it) and Anthropic-specific things in CLAUDE.md. Don't duplicate. Link CLAUDE.md to read AGENTS.md too. Future-you will thank you when you switch tools.
Limits and pricing
Pricing
Codex is included with paid ChatGPT plans. As of May 2026:
- Plus ($20/mo): Codex CLI access, modest cloud-task quota, GPT-5.5 default.
- Pro $100/mo: 5× cloud quota, priority on new features.
- Pro $200/mo: ~unlimited cloud usage for individuals, 1M-context model on long tasks.
- Business / Enterprise: shared organizational quota, admin controls, audit logs, Skills shared across the org.
You can also pay per-token via the OpenAI API for the underlying models if you want to bypass the ChatGPT subscription model, but the agent loop is gated to the ChatGPT product, not raw API access.
Cloud task limits
- Concurrent tasks: most users get 4 simultaneous tasks. Pro $200 raises it.
- Per-task wall clock: 30 minutes default; longer on higher tiers.
- Repo size: ~2GB clone size; larger repos work but slower.
- Branch policy: Codex creates branches under
codex/by default; configurable. - Network access: outbound HTTPS allowed; arbitrary inbound is not.
CLI limits
- Per-message token: model-dependent. Default 5.5 has comfortable headroom for most tasks.
- Session memory: large but bounded; very long sessions degrade like any agent. Use
/checkpoint+ restart for long-running work. - Rate limits: same 5-hour rolling window as ChatGPT proper. Voice mode in the CLI doesn't exist (voice is the consumer-app surface), so you're not double-spending.
Best practices
Commit before you run, every time.
The CLI's sandbox is good but not perfect, and a clean working tree before you start an agent run is the cheapest insurance you can buy. Treat git status being clean as the prerequisite to codex.
Write AGENTS.md as if a new engineer will read it.
Don't be terse. Don't be precious. Put the conventions, the test commands, the gotchas, the "if you ever need to do X, here's the path" notes. The agent reads it; you'll read it; the next person to use Codex on this repo will read it. AGENTS.md is the most underrated investment you can make in this tool.
For cloud tasks, write the issue like a spec.
Acceptance criteria. Test cases. Out-of-scope notes. Links to relevant code. The cloud agent reads the issue and produces a PR. Its quality is bounded by how clearly the issue was written. A 200-word issue with a failing test attached produces dramatically better PRs than a 30-word "fix the thing" prompt.
Review cloud PRs with elevated suspicion.
Codex's PRs are plausible. They compile. They pass the tests it ran. That doesn't mean they're correct in the way a careful human would be careful. Read the diff. Question changes that aren't in the issue's scope. If the PR refactors something nearby "while it was there," push back unless you asked for it.
Pin a model version when stability matters.
If you're scripting Codex CLI in CI or a shell pipeline, pin the model in the call. codex --model gpt-5.5 beats codex when the next OpenAI release flips the default and your output suddenly differs. Stability through pinning, not through hope.
Don't escalate to full-access "just to make it work."
Almost every "I need full-access" instinct is solvable with a more specific permission grant or a manual command. Treat full-access as the last 1%, reserved for when you've ruled out the alternatives, not as the first thing you try when workspace mode fails.
Use checkpoints during long sessions.
/checkpoint is cheap. Use it before any operation that touches multiple files or runs a destructive-looking command. Rolling back to a checkpoint is two seconds; recovering from a session you didn't checkpoint is sometimes longer than starting over.
Treat cloud and CLI as different working modes, not different products.
A real workflow: dispatch a fix to the cloud agent at the start of a meeting; review the PR after the meeting; if the cloud got 80% there but missed an edge case, drop into the CLI locally on the branch and finish it. Both surfaces, same task, different stages.
Troubleshooting playbook
The patterns to recognize, with a short fix for each.
Always check the active permission profile first. Workspace mode forbids almost all network access; if your task needs to npm install a fresh dependency, you need at least default. The error message tells you which profile to switch to.
Open the task in the dashboard and look at the run log. Every cloud task records its full transcript, including the commands it ran and the outputs. Most "mystery failures" are visible there; common causes:
- Tests requiring secrets the sandbox doesn't have.
- A custom build step (e.g. a Makefile target) that the agent didn't know to run.
- Network restrictions blocking a service the test suite needs.
The fix is usually to update AGENTS.md or the cloud task's runner config so the agent knows the right setup steps for the next run.
This is the most common failure mode. Codex makes the test pass by changing the test, not the code under test. Read the test diffs explicitly. If a previously-passing test was modified, ask why. If the answer is "to make it green for this change," push back via a PR comment. Codex will iterate.
On Windows, the most common cause is corporate antivirus or DLP blocking the installer's outbound HTTPS. Run from a different network or get the binary from your IT-managed package source. On macOS, Gatekeeper sometimes blocks the Homebrew tap on first install. Check that brew tap openai/tap succeeded.
The GitHub App needs explicit per-repo grants from the repo's owner or an admin. Personal repos are easy; org repos with branch protection need an org admin to install the Codex GitHub App at the org level. Most "I can't connect" reports trace back to org-level install permission missing.
The sandbox can't see OS-level security prompts. If you're running a command that triggers Touch ID (e.g. some sudo via Touch ID configurations), Codex pauses indefinitely waiting for input it can't see. Either complete the prompt manually, or run the command outside Codex.
Different sandbox = different environment. The cloud's sandbox has a specific OS image, specific tool versions, specific package state. Your laptop has yours. The same prompt against the same model can produce different code if the agent senses different environments. This is a feature, not a bug. If you need parity, document the environment in AGENTS.md so both surfaces target the same shape.
Closing thought
Codex in 2026 isn't one tool. It's a pair, plus a third surface in your IDE that ties them together. The CLI is the one you reach for when you want to watch the agent work; the cloud agent is the one you reach for when you want to not watch. Both are useful, often in the same week, sometimes on the same task.
Use the CLI for tasks where presence is value. Use the cloud agent for tasks where absence is value. — TWD
The biggest mistake people make with Codex is using one mode for everything. CLI users dispatch a long-running cloud task and sit there waiting. Cloud users force a context-heavy debugging session through the PR-review loop. Both miss the point. The two surfaces exist because they map to two different relationships you have with a piece of work, and recognizing which relationship you're in is most of the practice.
For up-to-date model availability, sandbox profiles, and feature flags: developers.openai.com/codex/changelog. The pace is high; don't trust month-old documentation on specifics.