№ 05Learn With Darin · Field Guide

Codex: a practitioner’s field guide.

OpenAI's two coding agents under one name: the codex CLI in your terminal and the cloud agent at chatgpt.com/codex. What each one's good at, what it costs, and which tasks belong to which.

Updated May 2026 ~22 min read Covers CLI, cloud, IDE extensions

Part 01

What "Codex" means in 2026

"Codex" has been three different products at OpenAI over five years. If you don't know which one someone is talking about, you'll guess wrong half the time. The current state:

Codex CLI: a terminal coding agent, OpenAI's answer to Claude Code. You install it, you run codex, it edits files in your repo, it iterates on code. Local, sandboxed, runs against your laptop.
Codex (cloud): a SaaS coding agent at chatgpt.com/codex. You connect a GitHub repo, give it a task, and it works in a cloud sandbox to produce a pull request. Distributed, parallel-capable, runs in OpenAI's environment.
Codex IDE extensions: the same agent surfaced in VS Code, JetBrains, and Cursor, sharing your task list across surfaces. Same account, same models.

(The Codex API from 2021, the original code-completion model, was deprecated in 2023 and is gone. If you read an article that mentions "Codex" without a date, suspect the article.)

Note OpenAI announced in March 2026 that the ChatGPT desktop app, Atlas browser, and Codex will merge into a single desktop application. As of May 2026 they're still separate installs, but the lines are blurring fast: the same account, the same task history, increasingly the same UI.

The thing to internalize: the CLI and the cloud agent are not competitors—they're collaborators. The CLI is for the work you want to watch happen on your machine. The cloud agent is for the work you want to outsource to a sandbox while you do something else. Knowing when to reach for which is the whole game.

Part 02

The Codex CLI

If you've used Claude Code, the Codex CLI will feel familiar in shape: a terminal-resident agent that reads your repo, plans changes, edits files, and runs commands inside a sandbox. The vocabulary is different but the loop is the same. Prompt, plan, edit, verify, repeat.

Install & run

Codex CLI ships as a single binary on macOS, Windows, and Linux. Install via the OpenAI installer or via Homebrew on Mac. On Windows you have two paths: native PowerShell with the Windows sandbox, or WSL2 for a Linux-native environment. Both work; pick based on which one matches your existing dev setup.

# macOS / Linux
curl -sSL https://codex.openai.com/install.sh | sh

# macOS via Homebrew
brew install openai/tap/codex

# Windows (PowerShell)
iwr -useb https://codex.openai.com/install.ps1 | iex

# Verify
codex --version
codex login

The first codex invocation in a project does a one-time scan and prompts you to write an AGENTS.md. Just like Claude Code's CLAUDE.md, this is the project-level instructions file. Unlike CLAUDE.md, AGENTS.md is cross-tool. Codex, Cursor, Aider, and an increasing number of other agents all read the same file. If you've ever wanted one config file your tools agree on, this is the closest thing to it.

The model picker

Codex CLI lets you switch models with /model. Defaults and options as of May 2026:

GPT-5.5: the recommended frontier model for complex work. Default on most installs.
GPT-5.4: slightly older, faster, cheaper per-token. Good for repetitive edits.
GPT-5.3-Codex: code-tuned variant of GPT-5.3, lighter than 5.5 but still solid for everyday coding.
Reasoning levels: for the frontier models you can set /reasoning low|medium|high; high tells the model to think longer before acting.

For most tasks, leave it on GPT-5.5 medium and don't worry about it. The reasoning-level lever matters when you're doing something genuinely novel: refactors that span the codebase, debugging that needs hypotheses, architectural changes. For tweaks, low or medium is faster.

Sandboxing

Codex CLI's sandbox is one of its strongest features. By default, every command Codex wants to run is constrained:

macOS

Uses Apple's built-in Seatbelt framework, the same sandbox that powers App Sandbox.
File access is scoped to the working directory by default; network is gated.
Permission prompts surface inline in the terminal: "Codex wants to run X. Allow?"
No additional install needed; Seatbelt is part of macOS.

Windows / Linux

Windows: native Windows Sandbox when you're in PowerShell, or Linux sandbox when you're in WSL2.
Linux: namespace + cgroup isolation similar to a container.
The Windows Sandbox feature has to be enabled in Windows Features. Codex prompts you on first run if it's off.
Network is gated by default in all profiles.

Permission profiles you can choose between (codex --profile):

workspace: read/write within the working directory only, no network. The default and the safe choice.
default: slightly broader access than workspace; can read system files but still gated.
full-access: the escape hatch. Codex runs without the sandbox. Use only for tasks that genuinely need it (e.g. spinning up a subprocess that itself needs root). Codex prompts you every session before allowing this.

Warn The full-access profile genuinely removes the sandbox. If something goes wrong (Codex deletes a file you didn't expect, runs a destructive command), you're in "I should have committed first" territory. Treat full-access the way you'd treat sudo: minimum necessary, never as a default.

The CLI loop in practice

Day-to-day, the loop looks like this:

You're in a project. You run codex.
You describe the task: "Add a search box to the header that filters the list view as I type."
Codex plans the change, lists the files it intends to touch, and asks for confirmation.
It edits files, runs your tests (if it found them), and reports back.
You inspect the diff, accept or reject, iterate.

Useful slash commands inside the session:

/model gpt-5.5: switch the active model.
/reasoning high: bump the reasoning level for the next turn.
/profile workspace: change the sandbox profile.
/diff: show the working diff Codex has accumulated.
/checkpoint: snapshot the current state; you can roll back to it.
/history: replay recent commands and decisions.
/agents: list the AGENTS.md context Codex is operating against.

How it differs from Claude Code

The honest comparison, from someone who uses both:

Dimension	Codex CLI	Claude Code
Surface	Terminal-only as of May 2026; IDE extensions surface the cloud agent rather than the CLI	Terminal + VS Code + JetBrains + desktop app + web
Config file	`AGENTS.md` (cross-tool, also read by Cursor, Aider, others)	`CLAUDE.md` (Anthropic-only)
Plugin model	Skills (cloud); built-in tools in CLI	MCP servers, slash commands, subagents, hooks, plugins, skills
Sandbox model	Seatbelt / Windows Sandbox / WSL2 native, minimal config, very tight defaults	Permission prompts per-action; sandbox via OS native is opt-in via settings
Default model	GPT-5.5	Sonnet 4.6, Opus 4.7 on opt-in extended thinking
Where it shines	Terminal-native tasks (scripts, system admin, DevOps); long sandboxed runs	Code review, complex refactors, multi-file architecture changes, hooks-based workflows
SWE-bench (May 2026)	Strong on Terminal-Bench 2.0 (~77%)	Leading on SWE-bench Verified with Opus 4.6 (~81%)

Both tools are excellent. The differences are real but smaller than the marketing material suggests. Most engineers who use both end up keeping both: Claude Code for ambitious feature work and code review, Codex CLI for terminal automation and "I trust the sandbox enough to walk away" tasks.

Part 03

Codex cloud (chatgpt.com/codex)

The cloud agent is the more transformative product, even if the CLI gets more attention. It's the first widely available "agent that opens PRs" that actually works for the everyday case.

The model

You connect a GitHub repository (or repositories) and give Codex a task (typically as an issue link or a prose description). Codex spins up a sandboxed cloud environment with your repo preloaded. It works through the task in the background. When it's done, you get a pull request with the changes, ready for review.

What's different from "let me autocomplete in my IDE":

Tasks run in parallel. You can dispatch four tasks across four cloud sandboxes; you don't sit at your machine waiting for one to finish.
Tasks have their own environment. Codex installs your dependencies, runs your tests, and iterates against them, all without touching your laptop.
Tasks produce reviewable artifacts. The output is a PR, not a chat transcript. The diff is the deliverable.
Tasks can be reviewed and re-run. If the PR isn't right, you leave a comment ("the test for X is wrong because…") and Codex iterates.

The dispatch loop

From chatgpt.com/codex, the dashboard, an open GitHub issue, or your IDE extension, you create a task.
Codex picks up the task, identifies the relevant files, and starts work in a fresh sandbox with your repo cloned.
It iterates: write code, run tests, inspect failures, fix, retry. Most non-trivial tasks involve several internal iterations.
When it converges (or gives up), it pushes a branch and opens a PR.
The task card in your dashboard shows status badges: draft, open, merged, closed.
You review the PR like any other PR. Comments → iteration. Merge → done.

Skills

One of Codex cloud's distinguishing features is Skills: reusable patterns the agent can apply across tasks like code understanding, prototyping, documentation, code review, and migrations. You can author Skills aligned with your team's standards (e.g. "follow our error-handling convention," "always add types in this style") and Codex will apply them across all tasks for that repo.

The mental model: Skills are to the cloud agent what AGENTS.md is to the CLI. Encoded team conventions that travel with the code, not with the engineer.

Where the cloud agent shines

Routine PRs that aren't worth your attention: dependency bumps with non-trivial code changes, lint cleanups, test scaffolding, simple migrations.
Bugs with a reproduction: file an issue with a failing test, dispatch to Codex, get a PR back.
Documentation drift: "the README is out of date for these features" works well as a Codex task.
Parallel exploration: dispatch four variations of a feature implementation; pick the one whose PR you like best.
Things that need a long sandboxed run, like large refactors, codebase-wide rename operations, or multi-step investigations, where you don't want to babysit a CLI.

Where the cloud agent struggles

Tasks with unclear acceptance criteria: if you can't write a failing test, the cloud agent guesses at what "done" means. The result is plausible but often wrong in subtle ways.
Tasks requiring repository context the agent can't see: secrets, private services, custom build steps. The sandbox is isolated by design.
UI work: the cloud sandbox has no browser to render your app. It writes the code; verifying it usually still requires a human running the change locally.
Codebases with custom toolchains that aren't in the sandbox image. You can extend the image, but it's friction.

Tip The single highest-leverage habit with Codex cloud is writing the failing test first. If you can hand Codex a red test, the task becomes "make this green," which is a problem agents are very good at. If the task is "implement X correctly," they're much less reliable, because "correct" is doing a lot of unspecified work.

Part 04

IDE extensions

Codex extensions exist for VS Code, JetBrains IDEs, and Cursor as of May 2026. The framing matters:

The extensions are not "the CLI in an IDE." They're a UI for the cloud agent that happens to live in your editor.
You see your task list, you can dispatch new tasks, you can review PRs in-IDE, and you can leave comments that Codex iterates on, all from your editor.
Local edits in your editor still happen via the CLI (or via the cloud agent's PR review flow once a PR is up).

The mental model that works: extensions are a productivity layer for the cloud agent—not a replacement for the terminal CLI. If you want a tool that edits your local files in your editor, that's still Cursor or Claude Code or the underlying editor's own AI. Codex's IDE story is about coordinating cloud-agent work without leaving your editor.

Part 05

Choosing between CLI and cloud (and Claude Code)

Most engineers using Codex seriously end up running both flavors plus Claude Code. The split:

Use the CLI when the task should happen on your machine.

Working with local services, reading files outside the repo, debugging something running on your laptop, generating code that needs to be tested with your environment, terminal automation. The CLI's tight sandbox plus Seatbelt/Windows Sandbox is the right primitive when "your machine" is the runtime.

ii.

Use the cloud agent when the task can run in a sandbox.

If your repo's tests run in CI, the cloud agent can run them too. Anything that's "fork the repo, make a change, get a PR" is a cloud task. Bonus when you can dispatch it and walk away. The cloud's serial latency doesn't bother you if you're not waiting.

iii.

Use Claude Code when you want a tightly collaborative loop.

For high-stakes refactors, code review, or complex architecture changes (anywhere you want to be in close dialogue with the agent), Claude Code's hooks, subagents, and IDE integrations have more breadth. Use Codex cloud when you want to delegate; use Claude Code when you want to collaborate.

iv.

For pure terminal automation, default to Codex CLI.

DevOps scripts, log analysis, "fix this Bash one-liner," shell-resident workflows. Both Claude Code and Codex CLI are excellent here, but Codex is the slightly better terminal-native operator on benchmarks and in practice.

Maintain one AGENTS.md and one CLAUDE.md per project.

If you're using multiple agents, write the cross-tool conventions in AGENTS.md (Codex, Cursor, Aider all read it) and Anthropic-specific things in CLAUDE.md. Don't duplicate. Link CLAUDE.md to read AGENTS.md too. Future-you will thank you when you switch tools.

Part 06

Limits and pricing

Pricing

Codex is included with paid ChatGPT plans. As of May 2026:

Plus ($20/mo): Codex CLI access, modest cloud-task quota, GPT-5.5 default.
Pro $100/mo: 5× cloud quota, priority on new features.
Pro $200/mo: ~unlimited cloud usage for individuals, 1M-context model on long tasks.
Business / Enterprise: shared organizational quota, admin controls, audit logs, Skills shared across the org.

You can also pay per-token via the OpenAI API for the underlying models if you want to bypass the ChatGPT subscription model, but the agent loop is gated to the ChatGPT product, not raw API access.

Cloud task limits

Concurrent tasks: most users get 4 simultaneous tasks. Pro $200 raises it.
Per-task wall clock: 30 minutes default; longer on higher tiers.
Repo size: ~2GB clone size; larger repos work but slower.
Branch policy: Codex creates branches under codex/ by default; configurable.
Network access: outbound HTTPS allowed; arbitrary inbound is not.

CLI limits

Per-message token: model-dependent. Default 5.5 has comfortable headroom for most tasks.
Session memory: large but bounded; very long sessions degrade like any agent. Use /checkpoint + restart for long-running work.
Rate limits: same 5-hour rolling window as ChatGPT proper. Voice mode in the CLI doesn't exist (voice is the consumer-app surface), so you're not double-spending.

Part 07

Best practices

Commit before you run, every time.

The CLI's sandbox is good but not perfect, and a clean working tree before you start an agent run is the cheapest insurance you can buy. Treat git status being clean as the prerequisite to codex.

ii.

Write AGENTS.md as if a new engineer will read it.

Don't be terse. Don't be precious. Put the conventions, the test commands, the gotchas, the "if you ever need to do X, here's the path" notes. The agent reads it; you'll read it; the next person to use Codex on this repo will read it. AGENTS.md is the most underrated investment you can make in this tool.

iii.

For cloud tasks, write the issue like a spec.

Acceptance criteria. Test cases. Out-of-scope notes. Links to relevant code. The cloud agent reads the issue and produces a PR. Its quality is bounded by how clearly the issue was written. A 200-word issue with a failing test attached produces dramatically better PRs than a 30-word "fix the thing" prompt.

iv.

Review cloud PRs with elevated suspicion.

Codex's PRs are plausible. They compile. They pass the tests it ran. That doesn't mean they're correct in the way a careful human would be careful. Read the diff. Question changes that aren't in the issue's scope. If the PR refactors something nearby "while it was there," push back unless you asked for it.

Pin a model version when stability matters.

If you're scripting Codex CLI in CI or a shell pipeline, pin the model in the call. codex --model gpt-5.5 beats codex when the next OpenAI release flips the default and your output suddenly differs. Stability through pinning, not through hope.

vi.

Don't escalate to full-access "just to make it work."

Almost every "I need full-access" instinct is solvable with a more specific permission grant or a manual command. Treat full-access as the last 1%, reserved for when you've ruled out the alternatives, not as the first thing you try when workspace mode fails.

vii.

Use checkpoints during long sessions.

/checkpoint is cheap. Use it before any operation that touches multiple files or runs a destructive-looking command. Rolling back to a checkpoint is two seconds; recovering from a session you didn't checkpoint is sometimes longer than starting over.

viii.

Treat cloud and CLI as different working modes, not different products.

A real workflow: dispatch a fix to the cloud agent at the start of a meeting; review the PR after the meeting; if the cloud got 80% there but missed an edge case, drop into the CLI locally on the branch and finish it. Both surfaces, same task, different stages.

Part 08

Troubleshooting playbook

The patterns to recognize, with a short fix for each.

"Codex refuses to run X"

Always check the active permission profile first. Workspace mode forbids almost all network access; if your task needs to npm install a fresh dependency, you need at least default. The error message tells you which profile to switch to.

"The cloud task failed but I don't see why"

Open the task in the dashboard and look at the run log. Every cloud task records its full transcript, including the commands it ran and the outputs. Most "mystery failures" are visible there; common causes:

Tests requiring secrets the sandbox doesn't have.
A custom build step (e.g. a Makefile target) that the agent didn't know to run.
Network restrictions blocking a service the test suite needs.

The fix is usually to update AGENTS.md or the cloud task's runner config so the agent knows the right setup steps for the next run.

"The PR looks right but the tests are wrong"

This is the most common failure mode. Codex makes the test pass by changing the test, not the code under test. Read the test diffs explicitly. If a previously-passing test was modified, ask why. If the answer is "to make it green for this change," push back via a PR comment. Codex will iterate.

"My terminal won't let Codex install"

On Windows, the most common cause is corporate antivirus or DLP blocking the installer's outbound HTTPS. Run from a different network or get the binary from your IT-managed package source. On macOS, Gatekeeper sometimes blocks the Homebrew tap on first install. Check that brew tap openai/tap succeeded.

"I can't connect a private repo to cloud Codex"

The GitHub App needs explicit per-repo grants from the repo's owner or an admin. Personal repos are easy; org repos with branch protection need an org admin to install the Codex GitHub App at the org level. Most "I can't connect" reports trace back to org-level install permission missing.

"Codex stopped at a Touch ID prompt"

The sandbox can't see OS-level security prompts. If you're running a command that triggers Touch ID (e.g. some sudo via Touch ID configurations), Codex pauses indefinitely waiting for input it can't see. Either complete the prompt manually, or run the command outside Codex.

"I get different output between CLI and cloud for the same prompt"

Different sandbox = different environment. The cloud's sandbox has a specific OS image, specific tool versions, specific package state. Your laptop has yours. The same prompt against the same model can produce different code if the agent senses different environments. This is a feature, not a bug. If you need parity, document the environment in AGENTS.md so both surfaces target the same shape.

Part 09

Closing thought

Codex in 2026 isn't one tool. It's a pair, plus a third surface in your IDE that ties them together. The CLI is the one you reach for when you want to watch the agent work; the cloud agent is the one you reach for when you want to not watch. Both are useful, often in the same week, sometimes on the same task.

Use the CLI for tasks where presence is value. Use the cloud agent for tasks where absence is value. — TWD

The biggest mistake people make with Codex is using one mode for everything. CLI users dispatch a long-running cloud task and sit there waiting. Cloud users force a context-heavy debugging session through the PR-review loop. Both miss the point. The two surfaces exist because they map to two different relationships you have with a piece of work, and recognizing which relationship you're in is most of the practice.

For up-to-date model availability, sandbox profiles, and feature flags: developers.openai.com/codex/changelog. The pace is high; don't trust month-old documentation on specifics.