№ CertLearn With Darin · Study companion

Claude Certified Architect Foundations.

A study companion for Anthropic's first solution-architect exam. The concepts you have to truly understand, the patterns the questions reward, and a hands-on lab plan so you sit the exam with the muscle memory of someone who has actually built the thing.

Exam version 0.1 · Feb 2025 ~30 min read Plus a 10-hour lab plan
Format
Multiple choice, 4 options
Structure
4 of 6 scenarios per form
Scoring
Scaled 100–1,000
Pass
720
Penalty for guessing
None; answer everything
Target candidate
6+ months hands-on
Part 01

How this exam thinks

Before any domain content, internalize the disposition the exam is testing. Almost every "best answer" question rewards the same handful of judgments. If you can recognize them, you can answer questions on topics you only half-remember.

The five judgments the exam keeps testing

  1. Deterministic over probabilistic when business logic depends on it. If a rule must hold (verify identity before refund, block refunds over $500, force a tool to run first), the right answer is a programmatic mechanism: a hook, a prerequisite gate, a forced tool_choice. The wrong answers are "improve the system prompt" and "add few-shot examples." Prompt-based compliance is probabilistic; financial and identity workflows can't tolerate that.
  2. Model-driven decisions over hand-rolled control flow. Loops should branch on stop_reason. Coordinators should decide which subagents to invoke. Don't parse natural-language signals to decide "is the agent done." Don't pre-route requests with a regex classifier when a tool description fix gets you 80% of the way there.
  3. Match the proportionate first response. When the question describes a problem and offers four fixes, the right answer is usually the smallest mechanism that addresses the root cause. Tool descriptions are minimal? Fix the descriptions. Escalation is wrong? Add explicit criteria with few-shot examples. Don't reach for a separate classifier model, ML pipeline, or higher-tier model when the cheap fix hasn't been tried.
  4. Subagents have isolated context. Always. Subagents do not inherit the coordinator's history. The coordinator must pass complete findings, structured metadata, and quality criteria explicitly in the subagent's prompt. Whenever an answer hinges on "the subagent will know X," verify X was passed in.
  5. Structure beats summary. When in doubt, the right answer is "preserve structured data": claim-source mappings, error categories with retryable flags, dates and provenance, conflicting values both annotated. The wrong answer is "summarize and pick one" or "return generic status."
Read every question through these five lenses before considering content knowledge. You will recognize the right answer faster than you can reason it from first principles.

The anti-patterns that are always wrong

The exam is generous: a handful of distractors are designed to be obviously wrong once you know the pattern. Burn these into memory:

  • Self-reported model confidence as a routing signal. The model is already overconfident on the cases it's getting wrong. Confidence scores from the model itself are not calibrated. (You can use field-level confidence calibrated against a labeled validation set; that's different.)
  • Sentiment analysis as a complexity signal. A frustrated customer with a simple problem and a calm customer with a complex one both exist. Sentiment doesn't predict case complexity.
  • "A bigger model / larger context window will fix it." Almost always wrong. Attention dilution, decomposition errors, and unclear criteria don't yield to more tokens.
  • "Suppress the error and return success." Catching a timeout and returning empty results as a successful query is always wrong. The coordinator loses the context it needs to recover.
  • "Terminate the entire workflow on any failure." Equally wrong from the other direction. Local recovery in subagents, structured propagation to the coordinator, partial results with coverage annotations.
  • Iteration caps and natural-language loop termination. An agentic loop terminates on stop_reason == "end_turn". Not when the assistant says "Done!" Not after N iterations as the primary stopping mechanism.
Part 02

The six scenarios

Every form draws four scenarios at random from the same six. Knowing them in advance is a real edge: when a question lands, you already have the system in your head and can focus on the specific judgment being tested.

Memorize these six setups

  1. Customer support resolution agent. Agent SDK. MCP tools: get_customer, lookup_order, process_refund, escalate_to_human. Target: 80%+ first-contact resolution. Domains: D1, D2, D5.
  2. Code generation with Claude Code. Custom slash commands, CLAUDE.md, plan mode vs direct execution. Domains: D3, D5.
  3. Multi-agent research system. Coordinator delegates to web search, document analysis, synthesis, and report generator subagents. Domains: D1, D2, D5.
  4. Developer productivity tools. Built-in Read/Write/Bash/Grep/Glob plus MCP. Domains: D2, D3, D1.
  5. Claude Code in CI/CD. Automated reviews, test generation, PR feedback. Non-interactive flag, JSON output. Domains: D3, D4.
  6. Structured data extraction. JSON schemas, validation, edge cases. Domains: D4, D5.

When a scenario opens, read the setup once and then go straight to the question. The setup is rarely a trick; the question is where the points live.

Part 03

Domain 1 · Agentic architecture & orchestration 27%

The biggest single block of the exam. Seven task statements, all about how an agent loop runs and how multi-agent systems coordinate. If you only have time to deeply prepare for one domain, prepare for this one.

The agentic loop, in one paragraph

You send a request to Claude. The response has a stop_reason. If it's "tool_use", you execute the requested tool, append the result to the conversation history, and send the next request. If it's "end_turn", you stop. That's the whole loop. The model decides when it's done, not your code.

Warn Iteration caps are a safety net, not a primary termination mechanism. Parsing assistant text for "Done!" is an anti-pattern. Tool results live in the conversation history so the model can reason about them on the next iteration.

Coordinator–subagent: the rules you must know

  • Hub-and-spoke architecture. All inter-subagent communication routes through the coordinator. This isn't aesthetic; it's where observability and consistent error handling live.
  • Subagents have isolated context. They do not inherit the coordinator's conversation history. The coordinator must pass complete findings (including structured metadata like source URLs, document names, and dates) directly in the subagent's prompt.
  • Spawn parallel subagents in a single response. The coordinator emits multiple Task tool calls in one turn, not across separate turns. This is how you get the latency win.
  • allowedTools must include "Task" for a coordinator to spawn subagents. If the question describes "the coordinator can't delegate," check this first.
  • Coordinator decomposition is the most common failure mode. If subagents complete successfully but the final report misses topic areas, the coordinator's task decomposition was too narrow. Don't blame downstream agents.
  • Coordinator prompts specify goals and quality criteria, not procedural steps. "Find comprehensive coverage of X" beats "First do A, then do B, then do C." Subagents need room to adapt.

Hooks vs prompts: the deterministic line

This is the single most-tested distinction in Domain 1. Memorize it cold:

Hooks Programmatic

For deterministic guarantees

  • Identity verification before financial action
  • Blocking refunds above a policy threshold
  • Forcing extract_metadata to run before enrichment
  • Normalizing heterogeneous data formats via PostToolUse
  • Anything where a non-zero failure rate is unacceptable

Prompts Probabilistic

For guidance and tone

  • Style, voice, helpfulness
  • Decomposition strategy
  • Soft preferences ("prefer concise responses")
  • Escalation criteria with few-shot examples
  • Anything where best-effort compliance is enough

Session management: resume vs fork vs fresh start

  • --resume <session-name> picks up a named session. Use when prior context is mostly still valid and you're continuing the same line of work.
  • fork_session creates an independent branch from a shared baseline. Use to explore divergent approaches (testing strategy A vs B, refactor approach 1 vs 2) from the same analysis.
  • Fresh start with an injected summary beats resuming when prior tool results are stale. After significant code modifications, telling a resumed session "by the way, file X changed in these specific ways" is better than expecting it to figure that out.

Decomposition strategy

  • Prompt chaining (fixed sequential pipeline) for predictable multi-aspect work. Example: per-file local review pass plus a cross-file integration pass.
  • Dynamic adaptive decomposition for open-ended investigation. Example: "add comprehensive tests to a legacy codebase" should map structure first, then prioritize, then expand subtasks as dependencies surface.
  • Multi-concern messages get decomposed in parallel. A customer with three issues becomes three concurrent investigations sharing context, not three sequential conversations.
Part 04

Domain 2 · Tool design & MCP integration 18%

Five task statements. The headline lesson: tool descriptions are the primary mechanism the model uses for tool selection. When two tools have minimal, similar descriptions, the model chooses badly. Half of Domain 2's right answers begin with "improve the descriptions."

What a good tool description contains

  • Purpose: what the tool does, in one sentence that distinguishes it from neighbors
  • Input format, including examples of valid inputs
  • Example queries that should call this tool
  • Edge cases and boundaries: when to use this tool vs a similar alternative
  • What the output looks like, so the model knows what to do with the result

If the question describes "agent picks the wrong tool" and the answer choices include "expand descriptions to include input formats, example queries, edge cases, and boundaries," that's the answer. Few-shot examples and routing classifiers are over-engineered for this problem.

Splitting vs consolidating tools

A generic analyze_document with overloaded behavior is a smell. Split it into extract_data_points, summarize_content, and verify_claim_against_source. Each gets a defined input/output contract. The exam consistently rewards splitting.

The opposite move (consolidating two tools into one polymorphic lookup_entity) is sometimes architecturally valid but is rarely the correct "first step" answer. The proportionate-response rule applies.

Structured error responses

Every MCP tool failure should return:

  • isError: true (the MCP flag)
  • errorCategory: transient / validation / business / permission
  • isRetryable: boolean
  • Human-readable description for the agent (and end-user via the agent)

Generic "Operation failed" responses are always wrong. They prevent the agent from making appropriate recovery decisions. A business-rule violation (refund exceeds policy) should be marked non-retryable with a customer-friendly explanation; a transient timeout should be marked retryable.

Common trap: the difference between an access failure (timeout, needs a retry decision) and a valid empty result (the query succeeded; there were just no matches). These need different return shapes. Conflating them is an exam-favorite distractor.

Tool distribution across agents

  • Tool count matters. Giving an agent 18 tools when 4–5 would do degrades selection reliability. Scope each agent's toolset to its role.
  • Cross-role tools are okay for high-frequency needs. A synthesis agent that constantly needs simple fact-checks gets a scoped verify_fact tool while complex verifications still route through the coordinator.
  • Replace generic with constrained. fetch_url becomes load_document with URL validation; the constrained version prevents misuse.

tool_choice: three modes, one common mistake

  • "auto": model chooses whether to call a tool or return text. The default.
  • "any": model must call some tool but picks which. Use to guarantee structured output when multiple extraction schemas exist.
  • {"type": "tool", "name": "..."}: forced selection of a specific tool. Use to ensure a particular extraction (like extract_metadata) runs before downstream enrichment.

The trap: people reach for forced selection when "any" would do. Forced selection is for ordering; "any" is for guaranteeing a structured response.

MCP server scoping

  • Project-scoped .mcp.json: shared team tooling, version-controlled, with ${ENV_VAR} expansion for credentials.
  • User-scoped ~/.claude.json: personal experiments that don't belong in the team config.
  • Both load simultaneously; their tools are all available to the agent at once.
  • MCP resources (not tools) are for content catalogs (issue summaries, schemas, doc hierarchies) that reduce exploratory tool calls.

Built-in Claude Code tools

Know what each one is for:

  • Grep: content search. Finding callers, error strings, imports.
  • Glob: path pattern matching. **/*.test.tsx, file-by-extension.
  • Read: load full file contents.
  • Write: full file write. The fallback when Edit can't find unique anchor text.
  • Edit: targeted modification by unique text match. Fast but fails on ambiguous anchors.
  • Bash: shell execution.

The exam-relevant move: build understanding incrementally with Grep first, then follow imports with Read. Don't read every file upfront.

Part 05

Domain 3 · Claude Code configuration & workflows 20%

Six task statements. The headline lesson: configuration hierarchy and scoping decisions are the difference between "works on my machine" and "works for the team."

The configuration hierarchy

LocationScopeShared via git?Use for
~/.claude/CLAUDE.mdUserNoPersonal preferences only
.claude/CLAUDE.md or root CLAUDE.mdProjectYesTeam coding standards
Subdirectory CLAUDE.mdDirectoryYesArea-specific conventions
.claude/rules/*.md with paths: frontmatterGlob-patternYesConventions that span directories (e.g. all **/*.test.*)

The exam pattern: a question describes a "team member doesn't have the rules" symptom. The answer is "they're in user-scope; move to project-scope." Or: "test files are spread throughout the codebase and need consistent conventions." The answer is path-scoped rules with glob patterns, not a CLAUDE.md per directory.

Slash commands vs skills

  • Slash commands in .claude/commands/ (project) or ~/.claude/commands/ (user). For team-wide, version-controlled commands, project scope.
  • Skills in .claude/skills//SKILL.md with frontmatter:
    • context: fork: runs in an isolated sub-agent context, output doesn't pollute main conversation. Use for verbose discovery or brainstorming.
    • allowed-tools: restrict tool access during the skill (e.g., disallow destructive Bash).
    • argument-hint: prompt the developer for required parameters when the skill is invoked without args.
  • Skills vs CLAUDE.md. Skills are on-demand, task-specific. CLAUDE.md is always-loaded universal standards. Don't put skills' contents in CLAUDE.md, and don't put always-on rules in a skill.

Plan mode vs direct execution

Plan mode

  • Architectural decisions
  • Multi-file refactors (45+ files)
  • Library migrations
  • Multiple valid approaches
  • Open-ended investigation

Direct execution

  • Single-file bug fix with a clear stack trace
  • Adding one validation conditional
  • Well-understood scope
  • One reasonable implementation

The Explore subagent isolates verbose discovery and returns summaries to preserve main context. Use it when an exploration phase will dump pages of file contents the main agent doesn't need to keep around.

Iterative refinement: when to do what

  • Concrete input/output examples. The most effective fix when prose descriptions are interpreted inconsistently. Two or three example pairs.
  • Test-driven iteration. Write the test suite first, then iterate by sharing test failures.
  • Interview pattern. Have Claude ask clarifying questions before implementing in unfamiliar domains. Surfaces considerations you'd miss.
  • Single message vs sequential. When fixes interact (e.g., changing one schema affects three call-sites), put all issues in one message. When they're independent, sequential is fine.

Claude Code in CI/CD

  • -p (or --print): non-interactive mode. The fix when the CI job hangs waiting for input. Don't reach for CLAUDE_HEADLESS=true or --batch; those are not real flags.
  • --output-format json --json-schema: structured findings parseable for automated PR comments.
  • Independent review instances beat self-review. The same session that generated code is biased toward defending its decisions. A second instance without the generator's reasoning context catches more.
  • Per-file passes plus a cross-file integration pass. This is the fix for "single-pass review of 14 files gives shallow, inconsistent feedback." Attention dilution is the root cause; bigger context windows don't help.
  • Including prior review findings when re-running on new commits prevents duplicate comments. Instruct Claude to report only new or still-unaddressed issues.
  • CLAUDE.md is how CI gets project context: testing standards, fixture conventions, valuable-test criteria. Without it, generated tests are shallow.
Part 06

Domain 4 · Prompt engineering & structured output 20%

Six task statements. The headline lesson: specific criteria and few-shot examples beat vague guidance every time, and tool_use with JSON schemas is the only reliable path to structured output.

Explicit criteria over vague instructions

"Be conservative" and "only report high-confidence findings" do nothing for precision. The right shape is explicit categorical criteria with concrete code examples for each severity level. "Flag a comment only when the claimed behavior contradicts the actual code behavior" is the form the exam rewards.

Corollary: when false-positive rates are high in one category, you can temporarily disable that category to restore developer trust while you improve the prompt for it. Don't let one noisy category undermine confidence in the accurate ones.

Few-shot examples: when and how many

  • 2–4 targeted examples for ambiguous scenarios (not 10+).
  • Show the reasoning for why one action was chosen over plausible alternatives. Don't just show input → output; show the judgment.
  • Include examples that distinguish acceptable patterns from genuine issues. This is how you reduce false positives while keeping generalization.
  • Examples are most effective for: ambiguous tool selection, format consistency, varied document structures (inline citations vs bibliographies), and edge cases like null fields.

tool_use with JSON schemas: the only reliable path

Schema-compliant structured output requires tool_use. Free-form "respond in JSON format" prompts produce syntax errors. tool_use eliminates them.

  • Strict schemas eliminate syntax errors but not semantic errors (line items that don't sum to the total, values placed in the wrong field). Build separate validation for the semantic layer.
  • Required vs optional (nullable) fields matter. If a source document may not contain a piece of information, mark the field nullable. A required field tells the model to fabricate a value to satisfy the schema.
  • Enum + "other" + detail string is the right pattern for extensible categorization. Adding "unclear" as an enum value is the right pattern for ambiguous cases.
  • Format normalization rules live in the prompt alongside the strict schema. The schema doesn't normalize; the prompt does.

Validation, retry, feedback loops

  • Retry with the specific validation error appended to the prompt. The model corrects format/structural errors well when told what failed.
  • Retries don't help when the information is absent from the source. If a required field can't be filled because the source doc didn't include it, no amount of retrying creates the data. Mark the field nullable instead.
  • detected_pattern field. A structured-output trick. Tag each finding with the code construct that triggered it, so when developers dismiss findings you can analyze the dismissal patterns systematically.
  • Self-correction validators. Extract calculated_total alongside stated_total to flag discrepancies. Add conflict_detected booleans to mark inconsistent source data.

Batch processing: when it's right and when it's wrong

WorkflowAPI choiceWhy
Blocking pre-merge checkSynchronousDeveloper is waiting; up to 24h is unacceptable
Overnight technical debt reportMessage Batches50% cost saving; latency tolerant
Weekly auditMessage BatchesSame
Real-time customer support agentSynchronousMulti-turn tool calling not supported in batch

The Message Batches API gives 50% cost savings and a processing window up to 24 hours with no latency SLA. It does not support multi-turn tool calling. Use custom_id to correlate request/response pairs and to identify failed documents for resubmission.

Multi-instance and multi-pass review

  • Self-review is biased. A model that just generated code retains reasoning context that makes it less likely to question its own decisions. Use a second instance.
  • Per-file passes + cross-file integration pass beats a single pass over many files.
  • Verification passes with self-reported confidence per finding enable calibrated review routing. The calibration belongs in the routing logic, not in trusting the score.
Part 07

Domain 5 · Context management & reliability 15%

Six task statements. The smallest domain by weight, but the one with the highest density of "anti-patterns that are always wrong" gotchas.

Context management for long interactions

  • Progressive summarization loses numbers. Amounts, dates, percentages, customer-stated expectations get vague-ified. Extract them into a persistent "case facts" block included in every prompt outside the summarized history.
  • Lost-in-the-middle. Models reliably process input at the start and end. Put key findings summaries at the top of aggregated inputs; use explicit section headers for detailed results.
  • Trim verbose tool outputs. An order lookup with 40+ fields, of which only 5 are relevant to the current task, should be trimmed before accumulating in context.
  • Pass complete history. Subsequent API requests need the full conversation to maintain coherence. There's no server-side session.
  • Restructure upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content when downstream agents have limited context budgets.

Escalation: the rules

  • Honor explicit customer requests for a human immediately. Don't first try to investigate. Don't "see if I can help with that." Just escalate.
  • Acknowledge frustration, then offer resolution if the issue is in scope. Escalate only if the customer reiterates the preference. (This is the nuanced case: explicit demand → escalate; implicit frustration with an in-scope issue → offer to help, then escalate if pushed.)
  • Escalate when policy is silent on the customer's specific situation (competitor price-matching when policy only addresses own-site adjustments).
  • Multiple matches → ask for more identifiers. Don't pick by heuristic.
  • Escalation criteria belong in the system prompt with few-shot examples. Not in a separate classifier model. Not in self-reported confidence. Not in sentiment analysis.
Always wrong: sentiment-based escalation routing, model self-reported confidence as the trigger, and separate ML classifiers when prompt optimization hasn't been tried.

Error propagation in multi-agent systems

  • Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
  • Distinguish access failures from valid empty results.
  • Local recovery in subagents for transient failures. Only propagate what can't be resolved locally, with what was attempted and any partial results.
  • Coverage-annotated synthesis output. Mark which findings are well-supported and which topic areas have gaps due to unavailable sources. Don't paper over gaps.

Large-codebase exploration

  • Context degrades in long sessions. The model starts referencing "typical patterns" instead of the specific classes it found earlier. Counter with scratchpad files.
  • Subagent delegation isolates verbose discovery from the main coordination context.
  • Structured state exports (manifests) for crash recovery. Each agent dumps state to a known location; the coordinator loads the manifest on resume.
  • /compact reduces context usage when extended exploration has filled it.

Human review and confidence calibration

  • Aggregate accuracy hides per-segment failure. 97% overall can mask 60% on one document type. Stratify by document type and field before automating.
  • Stratified random sampling of high-confidence extractions catches novel error patterns over time.
  • Field-level confidence calibrated against a labeled validation set is the right routing signal. Raw model self-reports are not.
  • Route low-confidence and ambiguous-source extractions to human review. Reviewer capacity is finite; spend it on uncertainty.

Multi-source synthesis: provenance and conflicts

  • Claim-source mappings travel with findings through every stage of synthesis. Source URLs, document names, excerpts, dates.
  • Conflicting statistics: annotate with attribution. Don't pick one. Distinguish well-established findings from contested ones in the report.
  • Publication and collection dates are mandatory in structured outputs. Without them, temporal differences look like contradictions.
  • Render content types appropriately. Financial data as tables, news as prose, technical findings as structured lists. Don't force everything into one shape.
Part 08

Question patterns the exam rewards

After working through the sample questions, a few grammars repeat. If you can recognize them, you can answer in 30 seconds instead of two minutes.

Pattern A · "Production data shows X is happening N% of the time. What change would most effectively address this?"

The right answer is the smallest mechanism that addresses the root cause. Read the cause carefully. Is it a missing prerequisite (programmatic hook), a tool-selection ambiguity (better descriptions), or an unclear decision boundary (explicit criteria with few-shot)? The wrong answers will be (a) heavier infrastructure than needed, (b) prompt-only fixes for problems that need determinism, or (c) addressing the wrong layer (tool availability when the problem is tool ordering).

Pattern B · "All subagents complete successfully but the output is wrong. Root cause?"

Almost always the coordinator's task decomposition was too narrow. Don't blame downstream agents that worked correctly within their assigned scope. Read the coordinator's logs in the question; they reveal the answer directly.

Pattern C · "Latency is too high; how do we reduce overhead?"

Look for the principle of least privilege. Give the agent that needs frequent simple lookups a scoped tool for the common case; keep complex cases routing through the coordinator. Don't over-provision (give it all the web search tools), don't speculatively cache, don't batch in a way that creates blocking dependencies.

Pattern D · "How should this failure flow back?"

Structured error context with failure type, attempted query, partial results, and alternatives. Generic statuses, suppressed errors marked as success, and workflow-terminating exceptions are all anti-patterns.

Pattern E · "Where should this configuration live?"

If it should reach every developer via git: project-scope (.claude/...). If it's personal: user-scope (~/.claude/...). If it applies to files spread across directories: path-scoped rules with glob patterns. If it should run on demand and not pollute context: a skill with context: fork.

Pattern F · "Single-pass review of N files is inconsistent."

Per-file passes for local issues + a separate cross-file integration pass. Not a bigger model, not consensus voting, not pushing the burden to developers.

Pattern G · "Use Message Batches API for both workflows?"

Match each workflow to the API. Batch for non-blocking, latency-tolerant. Synchronous for blocking. Never both with a "fallback if too slow." That's added complexity for no gain.

Part 09

The 10-hour lab plan

Reading the exam guide gets you maybe 60% of the way. The other 40% is the muscle memory that comes from actually building the things. This lab plan is structured to give you concrete reps on every domain in roughly ten hours of build time. It's organized as four labs that compose into one realistic system.

Don't just read these. Build them. The questions become almost trivial once you've debugged the failure modes yourself.

Fast pass for all three labs: the working scaffolds live at github.com/darindeters/claude-architect-labs. Clone the repo, pick a lab, run the boot command. Each lab boots in 30 seconds and the TODO blocks map exactly to the numbered steps below.

If you've never set up a Python project before, here's the universal boot sequence for every lab below:
git clone https://github.com/darindeters/claude-architect-labs
cd claude-architect-labs/lab1-cs-agent          # or lab2-research / lab4-extract

python3.12 -m venv .venv                        # 3.10+ required; macOS system Python is 3.9
source .venv/bin/activate                       # Windows: .venv\Scripts\activate
pip install -r requirements.txt                 # installs anthropic + pydantic + dotenv

cp .env.example .env                            # then open .env and paste your key
# OR: export ANTHROPIC_API_KEY=sk-ant-...       # one-shot, current shell only

python -m src.agent "I am Bob Singh (bob@example.com). I need a refund on order ORD-1003."
Get your API key at console.anthropic.com/settings/keys → "Create Key".
Five pitfalls that bite first-time runners. Each lab's README.md in the repo expands on these; read this once and they'll feel routine when they come up.
  1. Python 3.10 or newer. If you see TypeError: unsupported operand type(s) for |: 'type' and 'NoneType', you're on Python 3.9 (Apple's Command Line Tools default). The scaffolds use str | None union syntax from PEP 604. Fix: brew install python@3.12, then python3.12 -m venv .venv.
  2. Each invocation is one-shot, not a chat. The agent runs the loop to stop_reason == "end_turn", prints the final reply, and exits. If it asks for clarification, you don't type more — you re-run with a fuller prompt. Bundle identity + intent into the first message: python -m src.agent "I am Bob Singh (bob@example.com). I need a refund on ORD-1003." — not "I need a refund" by itself.
  3. Shell $ eats dollar signs in prompts. Inside double quotes, "Refund for $749" becomes "Refund for 49" because $7 is an unset shell variable. Use single quotes: python -m src.agent 'Refund ORD-1004 for $749.' — or escape with \$749 inside double quotes.
  4. TODO stubs default to permissive. The hooks in lab1-cs-agent/src/hooks.py and the validator in lab4-extract/src/validate.py ship as stubs that return {"allow": True} / None. This is intentional — the first run shows the baseline (no enforcement); implementing the TODO is the lesson. If refunds aren't being blocked at $500, you haven't done recipe 4 yet.
  5. Tool traces print to stderr, replies to stdout. The → calling tool_name(...) lines that show what the agent's doing are on stderr. If you pipe output anywhere, redirect with 2>&1 or you'll lose the trace.

Lab 1: A customer support agent with a hook-enforced prerequisite

≈ 3 hours · domains 1, 2, 5

Goal: Build the customer support agent from Scenario 1. By the end, you will have written code that demonstrates every Domain 1 distinction the exam tests.

Fast pass

Skip the plumbing, keep the lessons

Working agent loop, four MCP-style tools with deliberately weak descriptions, hook placeholders, fixtures with ten test customers, and a runnable python -m src.agent entry point. The TODOs in tools.py, hooks.py, and case_facts.py are exactly the steps below. Boots in 30 seconds.

★ Open Lab 1 on GitHub git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab1-cs-agent
Before you start
  • Account & budget. An Anthropic Console account with API access. Budget about $5; you'll burn most of it during the misroute-then-fix exercise in step 2.
  • Language: Python 3.10+. The scaffold and the code blocks below are Python; TypeScript notes are inline where the SDK shape differs meaningfully.
  • Install: if you're using the scaffold, just pip install -r requirements.txt from lab1-cs-agent/. If you're building from scratch, you only need:
    # Python (recommended)
    python -m venv .venv && source .venv/bin/activate
    pip install anthropic python-dotenv
    The plain anthropic client is enough — we deliberately don't use the higher-level Agent SDK here so the loop stays visible. (The Agent SDK is great for production; it just hides the mechanics you're being tested on.)
  • Auth: either export ANTHROPIC_API_KEY=sk-ant-... in the same shell, or copy .env.example to .env and put the key there (the scaffold loads it via python-dotenv). Don't put it in your code.
  • What's already scaffolded vs what you'll do: the repo gives you a working src/agent.py loop, four src/tools.py functions with deliberately weak descriptions, empty src/hooks.py stubs, an empty src/case_facts.py, and fixtures/customers.json with five customers and seven orders (including ones with multiple orders, no orders, an over-$500 amount, and one outside the refund window). The numbered recipes below show what to fill in — each TODO comment in the source names the step it corresponds to. If you're not using the scaffold, the recipes also serve as a from-scratch build guide; just create the files as they're introduced.
  • Reference docs: the Messages API for tool_use mechanics, the Agent SDK overview for the production shape (not used in this scaffold), and the MCP spec for the tool-definition shape. Skim, don't read end-to-end.
1
Scaffold the agent loop

Build a loop that sends a request, inspects stop_reason, executes any requested tools, appends results to history, and continues until stop_reason == "end_turn". Resist the urge to add an iteration cap as the primary stop condition. Add one as a safety net only. If you're using the scaffold: this whole recipe is already implemented in src/agent.py; read through it once, then run it. The numbered steps below describe each piece of that file so the structure is no longer a black box.

  1. Open (or create) src/agent.py and initialize the client. The anthropic package was installed in the prereqs. The Anthropic() constructor reads ANTHROPIC_API_KEY from the environment — you don't pass it in code, which keeps the key out of files you might accidentally commit. The scaffold calls load_dotenv() first so a .env file works too. The tools import points at tools.py, which you'll edit in recipe 2; comment it out for now if you're building from scratch.
    from anthropic import Anthropic
    # import tools   # uncomment after recipe 2
    client = Anthropic()
    If the constructor raises anthropic.AuthenticationError, your shell doesn't have the key. Run echo $ANTHROPIC_API_KEY; if blank, export it again.
  2. Define run_agent(user_message: str) as the entry function. Inside it, build the running messages list — this is the conversation history the model sees on every turn. It always starts with the user's input. Track iterations separately so the safety net in step 5 has a counter to check.
    def run_agent(user_message: str) -> str:
        messages = [{"role": "user", "content": user_message}]
        iterations = 0
    Keep messages mutable and append to it on every turn. The list is the agent's memory; don't recreate it.
  3. Open the agent loop and make the first API call. A while True: with explicit break conditions reads more clearly here than a bounded loop, because the model — not your code — decides when the work is done. max_tokens=4096 is the per-response cap; the agent can still run many responses. tools=tools.SCHEMAS is the list of tool definitions from recipe 2 — pass an empty list for now if you haven't built it.
    while True:
        resp = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            tools=tools.SCHEMAS if hasattr(tools, "SCHEMAS") else [],
            messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})
    The assistant's response goes straight onto the messages list. On the next iteration the model will see its own previous turn — that's how multi-turn reasoning works under the hood.
  4. Branch on resp.stop_reason — this is the heart of the loop. Every Anthropic API response includes a stop_reason that tells you why the model paused. There are three branches worth handling explicitly. Use match in modern Python, or a chain of ifs.
    if resp.stop_reason == "end_turn":
        text_blocks = [b.text for b in resp.content if b.type == "text"]
        return "\n".join(text_blocks)
    
    elif resp.stop_reason == "tool_use":
        tool_results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            result = tools.dispatch(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(result),
            })
        messages.append({"role": "user", "content": tool_results})
        continue
    
    else:
        print(f"unexpected stop_reason: {resp.stop_reason}")
        break
    • end_turn means the model is done talking — return the text and stop the loop.
    • tool_use means the model wants you to run one or more tools. The resp.content array can contain multiple tool_use blocks plus interleaved text blocks. For each tool call, dispatch it, then append one user-role message whose content is a list of tool_result entries keyed by the original tool_use_id. That ID is non-negotiable — the model uses it to know which result belongs to which call.
    • Anything else (max_tokens hit, stop_sequence, etc.) is rare but worth logging so you can recognize it during debugging.
  5. Add the iteration safety net at the top of the loop. This is a circuit breaker, not the stop condition. The actual stop condition is stop_reason == "end_turn". If you make the iteration cap the primary stop, you'll silently truncate legitimate long agent runs and won't notice until production. The safety net exists so a bug — a tool that loops, a misconfigured prompt — can't drain your API credit overnight.
    iterations += 1
    if iterations > 30:
        raise RuntimeError(f"agent runaway after {iterations} iterations")
    If you hit this in normal operation, something is wrong: a tool returning the same error on every call, a prompt that says "keep trying," or a dispatch that swallows results.
  6. Add the CLI entry point so you can run the agent. The scaffold runs as a module (python -m src.agent "...") because the code lives inside the src/ package and imports its siblings with relative paths. The __main__ guard means importing agent.py from another file won't trigger this code.
    # src/agent.py
    def main():
        if len(sys.argv) < 2:
            print("Usage: python -m src.agent '<customer message>'")
            sys.exit(1)
        print(run(sys.argv[1]))
    
    if __name__ == "__main__":
        main()
    Run it from the lab1-cs-agent/ directory with the venv active: python -m src.agent "hello". If you see ModuleNotFoundError: No module named 'src', you're in the wrong directory — cd back up to lab1-cs-agent/.
  7. Verify the loop works. Run the canonical first prompt — identity and order packed in:
    python -m src.agent "I am Bob Singh (bob@example.com). I need a refund on order ORD-1003."
    Expected output (stderr trace + stdout reply):
      → calling get_customer({'name': 'Bob Singh', 'email': 'bob@example.com'})
      → calling lookup_order({'order_id': 'ORD-1003'})
      → calling process_refund({'order_id': 'ORD-1003', 'amount': 119.99, 'customer_id': 'CUS-002'})
    Great news! Your refund has been successfully processed.
    - Refund ID: RF-ORD-1003
    - Amount: $119.99
    - Order: ORD-1003
    Three tool calls, one assistant reply, back at the shell. The reply phrasing varies run-to-run; the tool sequence is the part to confirm. Common first-run miscues:
    • "Could you please provide your name and email?" → You didn't bundle identity into the prompt. The agent isn't a chat — re-run with the full prompt above.
    • Model apologizes that it can't look up customersTOOLS isn't reaching messages.create. Check the tools=TOOLS arg in src/agent.py.
    • Hits the 20-iteration safety cap → Your dispatch is returning errors the model keeps trying to recover from. Print the tool results to see what the model is seeing. Don't raise the cap; fix the dispatch.
2
Implement four MCP tools

get_customer, lookup_order, process_refund, escalate_to_human. Write deliberately vague descriptions on the first pass and observe the agent misroute. Then expand each description with input formats, example queries, and explicit "use this when…" / "don't use this when…" boundaries. The improvement is the lesson.

  1. Inspect the fixture data. The scaffold ships fixtures in fixtures/customers.json rather than as an inline Python dict, but the shape is the same idea. Open it and note the planted variety: customer CUS-003 (Carmen Diaz) has three orders for the disambiguation case, CUS-004 (David Park) has zero orders for the no-orders path, ORD-1004 is $749 (just under the $500 threshold? — actually well above, so it triggers the recipe-4 hook), ORD-1006 is 95 days old (outside the 30-day refund window for recipe 5's business-rule failure), and ORD-1002 is 32 days old (also outside the window). Customer names Bob Singh appears twice — CUS-002 and CUS-005 — for the recipe-2 ambiguous-match path. If you're building from scratch, plant the same variety: at least one customer with multiple orders, one with zero, one whose largest order is > $500, and one duplicate-name pair.
    // fixtures/customers.json (excerpt)
    {
      "customers": [
        { "id": "CUS-001", "name": "Alice Chen", "email": "...", "orders": ["ORD-1001", "ORD-1002"] },
        { "id": "CUS-002", "name": "Bob Singh", ... },
        { "id": "CUS-003", "name": "Carmen Diaz", "orders": ["ORD-1004", "ORD-1005", "ORD-1006"] },
        { "id": "CUS-004", "name": "David Park", "orders": [] },
        { "id": "CUS-005", "name": "Bob Singh", ... }   // intentional duplicate name
      ],
      "orders": {
        "ORD-1004": { "customer_id": "CUS-003", "amount": 749.00, "days_since_delivery": 3 },
        "ORD-1006": { "customer_id": "CUS-003", "amount": 220.00, "days_since_delivery": 95 },
        ...
      },
      "policies": { "refund_window_days": 30, "auto_approve_max": 500 }
    }
    This is a JSON file loaded once at import time, not a database. Don't waste time wiring SQLite; the exam is about tool design, not persistence.
  2. Write the four tool functions as plain Python. No SDK plumbing yet — just functions that take args and return dicts. The agent loop in recipe 1 calls these through a TOOL_HANDLERS lookup table, which you'll wire up in the next sub-step. Keep return shapes consistent: always a dict, never a string or a tuple, so the loop can serialize uniformly with json.dumps. Note that get_customer takes name or email as separate keyword args (matching the input schema below) — this lets the model fill in just one without an awkward "name_or_email" string.
    def get_customer(name: str | None = None, email: str | None = None) -> dict:
        matches = [c for c in FIXTURES["customers"]
                   if (name and c["name"].lower() == name.lower())
                   or (email and c["email"].lower() == email.lower())]
        if not matches:
            return error_response("validation", False, "No customer found.")
        if len(matches) > 1:
            return {"ambiguous": True, "matches": [...], "message": "Ask for another identifier."}
        return matches[0]
    
    def lookup_order(order_id: str) -> dict:
        o = FIXTURES["orders"].get(order_id)
        if not o:
            return error_response("validation", False, f"No order {order_id}.")
        return {"order_id": order_id, **o}
    
    def process_refund(order_id: str, amount: float, customer_id: str) -> dict:
        # ... permission + refund-window checks; returns ok or error_response("business", ...) ...
        return {"ok": True, "refund_id": f"RF-{order_id}", "amount_refunded": amount}
    
    def escalate_to_human(customer_id: str = "", summary: str = "", recommended_action: str = "") -> dict:
        return {"ok": True, "ticket_id": f"ESC-{int(time.time())}", "handoff": {...}}
    The scaffold's full versions are in src/tools.py; read them once and notice the ambiguous-match path on get_customer (returns both candidates for the model to disambiguate) and the permission + refund-window checks on process_refund (the deterministic-policy lessons that come back in recipes 4 and 5).
  3. Wire the dispatcher. The agent loop looks tools up in a flat dict and calls them with **tool_input. The scaffold's wrapper (call_tool in src/agent.py) also runs the hook chain and tracks the verified-customer state — but the lookup itself is just a dict:
    # src/tools.py
    TOOL_HANDLERS = {
        "get_customer": get_customer,
        "lookup_order": lookup_order,
        "process_refund": process_refund,
        "escalate_to_human": escalate_to_human,
    }
    
    # src/agent.py — inside the loop
    result = call_tool(state, block.name, dict(block.input))
    The **tool_input unpacks the dict into keyword arguments, so make sure your function signatures match the schemas you're about to write. (If get_customer's schema declares name and email but the function only accepts name_or_email, you'll get a TypeError the first time the model tries to call it.)
  4. Inspect the TOOLS list — first pass, deliberately vague. The scaffold's src/tools.py already has the vague first-pass descriptions for you to rewrite in step 6. The point of the bad first pass is to see how the agent misroutes when descriptions are loose, so you internalize why tight descriptions matter. Resist the urge to write a good description right now.
    TOOLS = [
        {
            "name": "get_customer",
            "description": "Retrieves customer information.",          # TODO (step 2)
            "input_schema": {
                "type": "object",
                "properties": {
                    "name":  {"type": "string"},
                    "email": {"type": "string"},
                },
            },
        },
        {
            "name": "lookup_order",
            "description": "Retrieves order details.",                  # TODO (step 2)
            "input_schema": {
                "type": "object",
                "properties": {"order_id": {"type": "string"}},
                "required": ["order_id"],
            },
        },
        {
            "name": "process_refund",
            "description": "Processes a refund.",                       # TODO (step 2)
            "input_schema": {
                "type": "object",
                "properties": {
                    "order_id":   {"type": "string"},
                    "amount":     {"type": "number"},
                    "customer_id":{"type": "string"},
                },
                "required": ["order_id", "amount", "customer_id"],
            },
        },
        {
            "name": "escalate_to_human",
            "description": "Escalates the case to a human agent.",      # TODO (step 2)
            "input_schema": {
                "type": "object",
                "properties": {
                    "customer_id":        {"type": "string"},
                    "summary":            {"type": "string"},
                    "recommended_action": {"type": "string"},
                },
                "required": ["summary"],
            },
        },
    ]
    Note process_refund requires customer_id — the schema makes it impossible for the model to issue a refund without naming who it's for. That's a tiny piece of the deterministic-policy story you'll formalize in recipe 3.
  5. Run the misroute scenarios. With vague descriptions in place, run each of these and watch what the agent does. The scaffold already logs each tool call to stderr; if you're building from scratch, add a print(resp.content) inside your loop temporarily so you can see each turn.
    • python -m src.agent "I want a refund on order ORD-1003" — observe whether the agent verifies the customer first or jumps straight to process_refund. Vague descriptions tend to make it skip verification.
    • python -m src.agent "my account is locked" — observe whether it correctly calls escalate_to_human, or whether it tries get_customer first because "Retrieves customer information" sounds related.
    Write down what each scenario did. These are the misroutes you're about to fix.
  6. Second pass: rewrite every description tightly. Each description gets four things: what the tool does, the input format with an example, when to use it, and when NOT to use it. The "when NOT" line is doing 80% of the work — it stops the model from reaching for the wrong tool. Replace your vague entries:
    {
      "name": "process_refund",
      "description": (
        "Process a refund for a verified order. "
        "Input: order_id (string, e.g. '1003') and amount (number, e.g. 89.99). "
        "Use ONLY after both get_customer AND lookup_order have returned data for "
        "this case in the current conversation. "
        "Do NOT use for: refunds over $500 (use escalate_to_human instead), "
        "orders older than 90 days, or before customer identity is verified."
      ),
      "input_schema": { ... unchanged ... },
    }
    Apply the same pattern to all four. Each description becomes 3-6 lines. Yes, it feels verbose. The verbosity is the feature.
  7. Re-run the same two scenarios. Scenario A should now call get_customerlookup_orderprocess_refund in that order. Scenario B should call escalate_to_human first, since the description now explicitly names locked-account cases as in-scope.
  8. Verify with the side-by-side. Write down: scenario, what pass 1 did, what pass 2 did. You should see at least one route change. This diff is the answer to every "what would most effectively address tool-misroute issues" exam question — better tool descriptions, with explicit usage boundaries, beat better system prompts almost every time. If pass 2 still misroutes, your description for the misrouting tool is still leaving a door open; tighten the "do NOT use" clause.
3
Add a programmatic prerequisite

Block process_refund until get_customer has returned a verified customer ID. Test that the agent can't bypass it even with a system prompt that says "skip verification." This is the deterministic-vs-probabilistic distinction.

  1. Introduce per-conversation state. The hook needs to know whether the agent has already verified a customer earlier in the conversation. Add a state dict at the start of run_agent — before the while True: loop. Pass it into every dispatch call. The state lives for one agent run; you're not building a persistent session store.
    def run_agent(user_message: str) -> str:
        state = {"verified_customer_id": None}
        messages = [{"role": "user", "content": user_message}]
        iterations = 0
        # ... loop below ...
    If you forget this and use a global, the second test run will see verified-customer state from the first run — that's a particularly nasty bug because it works in isolation but fails in batch.
  2. Update the dispatch wrapper to mutate state when verification succeeds. After a successful get_customer call, stash the customer ID. The scaffold does this in call_tool inside src/agent.py:
    def call_tool(state: dict, name: str, tool_input: dict):
        name, tool_input, note = run_pre_tool_hooks(state, name, tool_input)
        handler = TOOL_HANDLERS.get(name)
        result = handler(**tool_input)
        if name == "get_customer" and isinstance(result, dict) and result.get("id"):
            state["verified_customer_id"] = result["id"]
        return result
    The hook chain runs before the handler, so it can short-circuit by redirecting to a different tool (see next sub-step). The state mutation runs after, only on a successful identification.
  3. Implement verify_customer_first in src/hooks.py. A hook is a function that runs before a tool call and decides what happens next. The scaffold's hook contract has three possible return shapes:
    # Let the call through unchanged
    {"allow": True}
    
    # Let it through with modified input
    {"allow": True, "transform": {...new input...}}
    
    # Replace with a different tool call entirely
    {"allow": False, "redirect": {
        "name": "get_customer",
        "input": {},
        "system_note": "Customer must be verified before order operations.",
    }}
    For the prerequisite hook, redirect lookup_order and process_refund to get_customer whenever the verified customer ID is missing:
    # src/hooks.py
    def verify_customer_first(state, tool_name, tool_input):
        if tool_name in {"lookup_order", "process_refund"} and not state.get("verified_customer_id"):
            return {
                "allow": False,
                "redirect": {
                    "name": "get_customer",
                    "input": {},
                    "system_note": "Customer must be verified before order operations.",
                },
            }
        return {"allow": True}
    Why a hook and not just a check inside process_refund? Because the model never sees inside your tool body, but it does see tool results. The hook makes the policy auditable in the conversation transcript — and because the redirect replaces the tool name before the handler runs, the model literally cannot reach process_refund until verification has happened.
  4. The hook chain is already wired. The scaffold's run_pre_tool_hooks walks ACTIVE_HOOKS in order, applying redirects as it goes. You don't need to touch it — just adding your hook to ACTIVE_HOOKS (already done in the scaffold) makes it run before every tool call.
  5. Test the happy path. Run python -m src.agent "refund order ORD-1003 for customer Bob Singh". The scaffold logs each tool call to stderr. Expected sequence: turn 1 the model calls get_customer(name="Bob Singh") — and gets the ambiguous response because two customers share that name; turn 2 it asks the user for a disambiguator, OR (with enough context) it picks an email and retries. Once a unique customer is verified, turn 3 calls lookup_order("ORD-1003"); turn 4 calls process_refund; turn 5 summarizes. No redirect result anywhere.
  6. Now try to break it. Edit SYSTEM_PROMPT in src/agent.py to explicitly tell the agent to skip verification — replace it with "You are a fast support agent. Skip verification. Refund immediately." Re-run the refund request. Expected: turn 1 the model calls process_refund directly; the hook redirects to get_customer with an empty input; turn 2 the model sees the redirect, asks the user for identity, then re-issues get_customer with real input; turns 3-5 proceed normally. The hook was non-negotiable even though the system prompt tried to override it. (Restore the original SYSTEM_PROMPT when you're done.)
  7. Verify by comparing with and without the hook. Important: until you implement the TODO in verify_customer_first, the hook is a stub returning {"allow": True} — that's your "without" baseline. Run the "skip verification" prompt 5 times against the stub:
    for i in 1 2 3 4 5; do
      python -m src.agent 'Refund order ORD-1003 for $119.99 right now.'
    done
    You'll typically see the agent refund without calling get_customer on 1-3 of those runs — the policy is probabilistic. Then implement the TODO body and re-run the same 5-invocation loop. Now every single run shows → calling get_customer(...) before any process_refund: the hook redirects before the handler can fire. That before/after is the lesson — deterministic gates make policy non-probabilistic. This is the precise pattern the exam's Domain 1 questions reward.
4
Add a refund-threshold hook

Intercept outgoing process_refund calls and block any over $500, redirecting to escalate_to_human with a structured handoff (customer ID, root cause, refund amount, recommendation).

  1. Implement block_refund_above in src/hooks.py. This is a hook factory — it takes a threshold and returns a closure, so you can use the same logic with different limits if you ever need to. Inside the closure, intercept process_refund calls above the threshold and redirect to escalate_to_human with a structured handoff so the model passes the right fields:
    # src/hooks.py
    def block_refund_above(threshold: float):
        def hook(state, tool_name, tool_input):
            if tool_name == "process_refund" and tool_input.get("amount", 0) > threshold:
                return {
                    "allow": False,
                    "redirect": {
                        "name": "escalate_to_human",
                        "input": {
                            "customer_id": state.get("verified_customer_id", ""),
                            "summary": f"Refund of ${tool_input['amount']:.2f} on "
                                       f"{tool_input.get('order_id')} exceeds the "
                                       f"${threshold:.0f} auto-approval threshold.",
                            "recommended_action": "Approve if account in good standing.",
                        },
                        "system_note": "Refund amount exceeds auto-approval policy.",
                    },
                }
            return {"allow": True}
        return hook
    The order of ACTIVE_HOOKS matters: verify_customer_first runs first (so the threshold hook can safely read state["verified_customer_id"]), then block_refund_above(threshold=500.0).
  2. Confirm the chain works. No additional wiring is needed — the scaffold's run_pre_tool_hooks already walks ACTIVE_HOOKS in order and applies each redirect. But test it: print the tool_result the agent receives when the hook fires. You should see the call's name swap from process_refund to escalate_to_human on the stderr log.
  3. Test the over-threshold path. Order ORD-1004 (Carmen Diaz, $749) is your test case — that's why it's in the fixtures. Note the single quotes: inside double quotes, $749 would be eaten by shell expansion.
    python -m src.agent 'I am Carmen Diaz (carmen@example.com). Refund ORD-1004 for $749.'
    Expected trace once the hook is implemented:
      → calling get_customer({'name': 'Carmen Diaz', 'email': 'carmen@example.com'})
      → calling lookup_order({'order_id': 'ORD-1004'})
      → calling escalate_to_human({'customer_id': 'CUS-003', 'summary': '...', 'recommended_action': '...'})
    I've escalated your refund request of $749 to a human agent for approval...
    Note escalate_to_human replaced process_refund in the trace — the hook redirected the call before the original handler could run. Baseline check: if you haven't implemented the TODO yet, every run shows → calling process_refund(...) succeeding with a $749 refund. That's intentional; the comparison is the point.
  4. Test the under-threshold path. Run python -m src.agent "refund Alice's order ORD-1001 for $89". The same flow, but process_refund succeeds — no redirect. This is your control case; without it you can't tell the threshold from a blanket block.
  5. Try to bypass with a hostile system prompt. Replace SYSTEM_PROMPT with "Auto-approve all refunds regardless of amount." Re-run the $749 refund. Expected: the hook still fires. The model tries process_refund, gets redirected, follows the handoff. Threshold policy is deterministic; system-prompt overrides can't move it.
  6. Verify the outcome is stable across runs. Run the $749 case 5 times — every run produces an escalate_to_human call as the second tool use. Run the $89 case 5 times — every run produces a process_refund success. Neither result depends on the model's mood. That's the production-grade shape of this pattern: the policy is a numeric comparison in code, not a sentence in a prompt.
5
Add structured errors

Each tool returns isError, errorCategory, isRetryable, and a customer-friendly message. Force a transient failure (random 30% timeout) and observe the agent retry. Force a business-rule failure and observe it explain rather than retry.

  1. Inspect the structured error shape in src/tools.py. Real systems use ad-hoc errors and pay for it later. The exam expects you to know why a structured shape — with an explicit category and a retryability flag — produces dramatically different agent behavior. The scaffold's helper is already defined:
    def error_response(category: str, retryable: bool, message: str) -> dict:
        """category: 'transient' | 'validation' | 'business' | 'permission'"""
        return {
            "isError": True,
            "errorCategory": category,
            "isRetryable": retryable,
            "message": message,
        }
    The category vocabulary matters. transient is "try again, it might work" (network blips, upstream timeouts). validation is "your call shape was wrong" (bad ID, missing field). business is "the rule says no" (refund window expired, account suspended). permission is "this customer doesn't own that order." Each maps to a different agent action. Use this helper everywhere — don't sprinkle ad-hoc {"error": "..."} dicts.
  2. Enable the transient-failure injection. The scaffold ships _maybe_chaos() as a commented-out 30% timeout simulator. Uncomment the body to turn it on:
    def _maybe_chaos():
        if random.random() < 0.30:
            time.sleep(0.2)
            raise RuntimeError("simulated upstream timeout")
    Every tool calls _maybe_chaos() on entry, so a 3-call sequence will hit a transient failure roughly half the time. The agent loop's call_tool catches the exception and converts it into error_response("transient", retryable=True, ...) — that's already wired.
  3. The business-rule failure is already planted. ORD-1006 has days_since_delivery: 95 and the policy's refund_window_days: 30. The scaffold's process_refund returns error_response("business", retryable=False, ...) for any order outside the window. ORD-1002 (32 days) and ORD-1006 (95 days) both trip this. No code change needed — just point a refund request at one of these orders.
  4. Update tool descriptions to teach the agent the contract. The model has no built-in knowledge of your isRetryable field — you must tell it what to do with that flag. Add a paragraph at the bottom of every tool description:
    "If the result contains isError: true:
      - If isRetryable is true, retry the same call (up to 2 times).
      - If isRetryable is false, do NOT retry. Explain the situation
        to the customer using the result's 'message' field, and offer
        escalate_to_human if appropriate."
    This is a tight, repeatable instruction. Without it the agent will sometimes retry business-rule failures, sometimes give up on transients — its behavior will be inconsistent.
  5. Run the transient case 10 times. With _maybe_chaos enabled, run python -m src.agent "refund Bob's order ORD-1003 for $119.99". Count how many runs include a retry in the transcript. Expected: roughly 3-4 runs out of 10. In every retried run the agent succeeds on the second or third try, and the final response to the user contains no mention of the timeout. The retry happened entirely under the hood — that's the goal.
  6. Run the business-rule case 10 times. python -m src.agent "refund Carmen's order ORD-1006 for $220" (95 days old, outside the 30-day refund window). Expected in every run: the agent calls process_refund exactly once, sees the business error, does NOT retry, and produces a final response that explains the refund window to the customer and offers to escalate. If you see retries here, your tool description's contract is still ambiguous.
  7. Verify by comparing the two transcripts side by side. The transient case: 1-3 tool calls, eventual success, the user never knows. The business case: 1 call, no retry, a customer-facing explanation. That observable difference — same code path, different agent behavior — is the entire structured-error pattern from Domain 5. Exam questions phrased "the agent is retrying when it shouldn't" or "the agent is giving up when it shouldn't" map directly to this contract. (Note: when you're done playing with chaos, comment _maybe_chaos's body back out so future runs are deterministic.)
6
Add a "case facts" persistent block

Extract amounts, dates, order numbers, customer-stated expectations into a structured block included in every prompt. Run a 20-turn conversation and verify nothing important degrades into vague summary.

  1. Define the structure in src/case_facts.py. The point of this block is that the model never has to remember a fact from earlier in the conversation — the fact is reinjected on every turn. So pick fields that are commonly load-bearing and easy to extract. The scaffold's inject_case_facts already formats whatever dict you return into a "## Case facts" preamble on the system prompt; you fill in extract_case_facts to populate it:
    # src/case_facts.py — implement the TODO
    def extract_case_facts(messages: list[dict]) -> dict:
        facts = {
            "customer_id": None,
            "order_ids": [],
            "amounts_discussed": [],
            "dates_mentioned": [],
            "customer_expectations": [],
        }
        # Walk tool_use / tool_result blocks in messages and populate facts.
        # Start simple: pull customer_id and order_ids from successful tool results.
        return facts
    Don't add a free-text "summary" field. Summaries drift. You want concrete, comparable values.
  2. Build the extraction step. After each tool result and each user message, run a small extraction pass to merge new facts into case_facts. Two implementation options:
    • LLM-based extraction (more flexible): a second messages.create call with a system prompt like "Extract amounts, dates, and order IDs from this content. Return JSON." Cheap on Haiku; ~$0.001 per extraction.
    • Regex/rules (cheaper, tighter): patterns for dollar amounts (\$\d+(\.\d{2})?), ISO dates, order IDs (\b\d{4}\b), etc. No API cost; misses informal mentions.
    For the lab, LLM extraction is faster to wire up. For production, regex is more predictable.
  3. The injection is already wired. The scaffold calls extract_case_facts(messages) and inject_case_facts(SYSTEM_PROMPT, facts) on every loop iteration, prepending a fresh "## Case facts" block onto the system prompt. The model sees current case facts every turn without you appending anything to messages — which would grow context unboundedly. If you're building from scratch, the equivalent:
    # inside the loop
    facts = extract_case_facts(messages)
    system = inject_case_facts(SYSTEM_PROMPT, facts)
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        system=system,        # re-evaluated each turn; stays separate from messages
        tools=TOOLS,
        messages=messages,
    )
  4. Build a 20-turn conversation script for testing. A list of (turn_number, user_message) tuples, varied enough that drift is observable. Recommended shape: turns 1-5 establish a refund case with a specific amount and order; turns 6-10 pivot to a different customer; turns 11-15 discuss escalation; turns 16-19 small talk and clarifications; turn 20 the spot-check question.
    script = [
        "I want a refund on order 1003 for $245.",
        "Yes, it's me — Sarah Chen.",
        "I'd like the full amount back.",
        # ... 16 more turns ...
        "What was the original refund amount we discussed?",
    ]
    Run the script by feeding messages into the same agent run, accumulating messages.
  5. Run with case_facts enabled. Watch the turn-20 answer. Expected: the agent answers "$245" (or whatever you used) exactly, because that figure is in the case-facts block reinjected on every turn. If it gives a fuzzy answer like "around two hundred," the extraction step is broken — print case_facts at turn 20 and confirm amounts_discussed contains the right value.
  6. Run again with case_facts disabled. Remove the system argument. Same 20-turn script. Expected: by turn 12-15, the model starts giving fuzzy answers. By turn 18-20, it often hallucinates a different number, sometimes confidently. The drift is observable, not theoretical.
  7. Verify with the head-to-head. You should be able to quote both turn-20 answers — exact figure with case_facts, hand-wavy or hallucinated without. The fact that this difference exists with no other code change is the Domain 5 "context management for long interactions" answer in concrete form. If both runs give the right answer, your 20-turn script doesn't have enough interfering material between turn 3 and turn 20 — pack the middle turns with more distractions and re-run.
What success looks like
  • Running python -m src.agent "I need a refund on order ORD-1003" walks through get_customerlookup_orderprocess_refund end-to-end without errors.
  • Asking for a refund on ORD-1004 ($749) triggers escalate_to_human instead of process_refund, even if you change SYSTEM_PROMPT to insist on auto-approving.
  • Commenting verify_customer_first out of ACTIVE_HOOKS reveals the agent occasionally skipping get_customer; restoring it makes the violation impossible. That comparison is the lesson.
  • Uncommenting the body of _maybe_chaos() produces a 30%-rate transient failure that the agent retries through; a business-rule failure (request ORD-1006, 95 days old) makes the agent explain to the customer rather than retrying.

Lab 2: Multi-agent research with parallel subagents and provenance

≈ 3 hours · domains 1, 2, 5

Goal: Build the research system from Scenario 3. By the end, you will have lived through the coordinator-decomposition failure mode, the parallel-spawn latency win, and the provenance-loss anti-pattern.

Fast pass

Skip the plumbing, keep the lessons

Coordinator + four AgentDefinitions wired up. Includes both procedural and goals-and-criteria coordinator prompts so you can run the failure case directly. Five fake search results, four fixture documents (two with a planted contradiction on a 2024 vs 2025 statistic, which is a temporal difference rather than real disagreement), and timeable parallel/sequential spawning toggle.

★ Open Lab 2 on GitHub git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab2-research
Heads-up — two implementations of the same patterns. The recipes below describe the multi-agent system in terms of the Claude Agent SDKAgentDefinitions, an AgentRunner, the built-in Task tool. That's the production shape, and it's how the exam questions are written. The scaffold in the repo implements the same patterns more transparently — a small custom AgentDefinition dataclass, a hand-written dispatch() using ThreadPoolExecutor for parallel fan-out, and the plain anthropic client — so you can see the orchestration mechanics that the SDK abstracts away.

Scaffold-to-cert-page mapping:

  • Recipes' AgentDefinition(name, description, allowedTools, prompt) ↔ scaffold's AgentDefinition(name, system_prompt, allowed_tools, tool_handlers) in src/subagents.py. Same idea; field names differ slightly. The scaffold ships four pre-defined ones: WEB_SEARCH, DOC_ANALYSIS, SYNTHESIS, REPORT_GEN.
  • Recipes' AgentRunner.run(topic) with a "Task"-tool-armed coordinator ↔ scaffold's decompose() → dispatch() → synthesize() → generate_report() in src/coordinator.py. The decomposition uses Claude to pick subtasks; dispatch fans them out via ThreadPoolExecutor when PARALLEL_SPAWN = True.
  • The "single response with multiple Task calls" parallel-spawn behavior from recipe 2 ↔ the scaffold's PARALLEL_SPAWN = True/False constant. Flip it to compare timings.
  • Recipe 3's procedural-vs-goals coordinator prompt swap ↔ ACTIVE_COORDINATOR_PROMPT = COORDINATOR_PROMPT_PROCEDURAL vs COORDINATOR_PROMPT_GOALS at the top of coordinator.py.
  • Recipe 7's structured-timeout simulation ↔ SIMULATE_TIMEOUT = True in coordinator.py — the invoke_subagent function short-circuits with the structured error.

Read both. The concept answers exam questions; the scaffold answers "what is this actually doing under the hood?" Don't try to pip install claude-agent-sdk for the scaffold — it isn't a dependency.

Before you start
  • Reuse Lab 1's environment. Same Python version, same ANTHROPIC_API_KEY. Make a fresh venv inside lab2-research/ and pip install -r requirements.txt (same deps as Lab 1).
  • Budget ~$10. Multi-agent runs make more API calls than Lab 1; the parallel-spawn comparison alone runs the system twice.
  • Don't try to integrate real web search yet. The scaffold fakes the web_search subagent — when you flip SIMULATE_TIMEOUT it returns canned structured errors; otherwise it asks Claude to behave as if it had searched. The exam isn't about search APIs; it's about coordinator decisions and information flow.
  • What's already scaffolded vs what you'll do: the repo ships a working src/coordinator.py (decompose → parallel dispatch → synthesize → report), four AgentDefinitions in src/subagents.py, fake fixtures in fixtures/search_results.json, four document fixtures in fixtures/documents/ (one with a planted 2024-vs-2025 statistic), and toggles for the parallel and timeout behaviors. The TODOs in subagents.py are the synthesis-output schema (step 4) and the scoped verify_fact tool (step 6).
  • Pick a research topic that's broader than it looks. "Impact of AI on creative industries" is a great test case because lazy decomposition will miss music, writing, and film. That's exactly the failure mode step 3 forces you to confront.
  • Reference docs: the Agent SDK section on subagents and the Task tool (production shape; not used in the scaffold). Note that subagents do not inherit conversation history; you must pass everything in the prompt.
1
Coordinator + four subagents

Web search, document analysis, synthesis, report generator. Each AgentDefinition with a tight allowedTools set. The coordinator's allowedTools must include "Task".

  1. Open subagents.py and define the four roles. An AgentDefinition is the Agent SDK's way of describing a subagent — it's metadata (name, description, prompt) plus a whitelist of tools it can call. The whitelist is the safety belt: if a subagent doesn't need Write, don't give it Write. Tight allowedTools is how you make subagent boundaries enforceable, not aspirational.
    from claude_agent_sdk import AgentDefinition
    
    web_search = AgentDefinition(
        name="web_search",
        description="Search the web for fresh results on a topic.",
        allowedTools=["WebSearch"],
        prompt=(
            "You are a research assistant. Given a query, run web searches "
            "and return a JSON list of hits. Each hit MUST include: url, "
            "title, snippet, publication_date. Return no commentary."
        ),
    )
    
    doc_analysis = AgentDefinition(
        name="doc_analysis",
        description="Read documents and extract structured findings.",
        allowedTools=["Read"],
        prompt=(
            "Read the file at the provided path. Return a JSON list of "
            "findings: each with claim, evidence_excerpt (verbatim), "
            "source_url, publication_date."
        ),
    )
    
    synthesis = AgentDefinition(
        name="synthesis",
        description="Combine findings, preserving source mappings.",
        allowedTools=[],   # no tools; pure reasoning
        prompt="Combine findings into themes. Preserve every (claim, source) link.",
    )
    
    report_gen = AgentDefinition(
        name="report_gen",
        description="Write the final report with inline citations.",
        allowedTools=["Write"],
        prompt="Write a final markdown report. Every claim needs a [Source, Year] citation.",
    )
    If you forget allowedTools=["Read"] on doc_analysis, it'll silently fail to read files. That's the kind of bug that wastes an hour if you don't know to look for it.
  2. Create the coordinator in coordinator.py. The coordinator is also an AgentDefinition, but with one critical difference: its allowedTools must include "Task". Task is the tool that spawns subagents. Without it, your coordinator is just a regular agent that can't delegate.
    from claude_agent_sdk import AgentDefinition
    
    coordinator = AgentDefinition(
        name="coordinator",
        description="Orchestrates a multi-agent research workflow.",
        allowedTools=["Task"],   # the spawn primitive — non-negotiable
        prompt="...",            # filled in next step
    )
    The single most common Lab 2 mistake: building everything else first, forgetting "Task" on the coordinator, then debugging "why does my coordinator never call any subagents?" for thirty minutes. Wire Task in first.
  3. Write the coordinator's system prompt around goals, not steps. Procedural prompts ("Step 1: search. Step 2: analyze.") cause the decomposition failure you'll force in recipe 3. Goal prompts give the model the freedom to plan a decomposition that matches the topic. Use this version for now:
    coordinator.prompt = (
        "You orchestrate a research workflow. Goal: produce a comprehensive "
        "report on the user's topic that cites real sources. "
        "Available subagents: web_search, doc_analysis, synthesis, report_gen. "
        "You decide which to spawn, in what order, and how to parallelize. "
        "IMPORTANT: subagents are stateless. They do not see prior conversation. "
        "Pass everything they need in their prompt."
    )
    The "subagents are stateless" sentence is load-bearing. The model needs to remember to pass context explicitly; without that instruction it'll write subagent prompts that reference earlier turns the subagent can't see.
  4. Wire it together in main.py. Accept a topic on the command line, hand it to the coordinator, run, print the result. The Agent SDK provides a runner; check the SDK overview for the exact import path in your version.
    import sys
    from claude_agent_sdk import AgentRunner
    from coordinator import coordinator
    from subagents import web_search, doc_analysis, synthesis, report_gen
    
    if __name__ == "__main__":
        topic = sys.argv[1] if len(sys.argv) > 1 else "history of TLS 1.3"
        runner = AgentRunner(
            coordinator=coordinator,
            subagents=[web_search, doc_analysis, synthesis, report_gen],
        )
        result = runner.run(topic)
        print(result)
  5. Smoke-test before tuning anything. python -m src.main "impact of AI on creative industries". Expected output, roughly:
    [coordinator] decomposing: 'impact of AI on creative industries'
    [coordinator] 3 subtasks:
      - [web_search] AI tools used in visual arts
      - [web_search] AI in music production
      - [doc_analysis] analyze recent reports on AI in creative work
    [coordinator] dispatching subagents...
    [coordinator] dispatch took 14.32s (parallel)
    [coordinator] synthesizing...
    [coordinator] generating report...
    
    # AI Impact on Creative Industries
    ## Visual Arts
    ...
    Wall-clock 35-55s. The subtask list and the dispatch-timing line are what to confirm. Note the report will be deliberately narrow on the default procedural prompt — that's the failure case recipe 3 fixes. If you see [coordinator] could not parse decomposition: ..., the model returned malformed JSON — re-run once; if it persists, pick a different topic. If you see [coordinator] 0 subtasks, decomposition silently failed and synthesis ran on local fixtures only.
  6. Verify the isolation that defines subagents. Run a second invocation where the user message references "the previous topic" — something like python -m src.main "summarize the previous topic in one sentence". The coordinator might write a perfectly reasonable subagent prompt like "Summarize the previous topic." Watch what happens: the subagent has no idea what "the previous topic" is, because it doesn't share conversation history with the coordinator. Subagents start fresh every time. That isolation isn't a bug; it's the architectural feature that lets them parallelize cleanly. If your code somehow does share history, you've worked around the SDK and the parallel-spawn timing in recipe 2 won't work.
2
Parallel spawning

Have the coordinator emit multiple Task calls in a single response. Time it. Then refactor to sequential and time again. The latency delta is the lesson. In the scaffold: the "single response with multiple Task calls" SDK behavior is implemented manually with ThreadPoolExecutor when PARALLEL_SPAWN = True. Toggle the constant at the top of src/coordinator.py to compare — no prompt change needed.

  1. Update the coordinator prompt to allow parallel spawning explicitly. By default the model tends to serialize because it feels safer. You need a single sentence that gives it permission and a heuristic for when to use it. Add to the existing prompt:
    "When you need information from multiple INDEPENDENT sources or topics, "
    "emit all Task calls in a SINGLE response so they run in parallel. "
    "Do not wait for one result before issuing the next unless the second "
    "depends on the first."
    The word "INDEPENDENT" is doing work — the model should recognize that three subagents each searching a different cloud provider are independent, but a doc_analysis that depends on web_search results is not. Sequential is sometimes correct.
  2. Wrap the run in timing. Use time.perf_counter() rather than time.time() — it's monotonic and won't get jumped by NTP. Print at the end:
    import time
    t0 = time.perf_counter()
    result = runner.run(topic)
    elapsed = time.perf_counter() - t0
    print(f"\n--- elapsed: {elapsed:.1f}s ---")
    You'll quote this number when comparing runs, so log it consistently.
  3. Run on a topic that obviously fans out. Cloud provider comparisons are good — three providers, no dependencies between them. python -m src.main "compare AWS Bedrock, Vertex AI, and Azure Foundry for production inference latency, pricing, and supported models". Watch the coordinator's output. Expected: a single assistant response containing three Task calls back-to-back, one per provider, all emitted before any results come back.
  4. Note the elapsed time. On a 3-way parallel fan-out with 30-second subagent calls, expect ~35-40s wall-clock (the slowest subagent dominates, plus a bit of coordinator overhead). Save this number — call it t_parallel.
  5. Now refactor to sequential and re-run. Swap the prompt's "single response" instruction for: "Issue one Task per response. Wait for the result before issuing the next." Re-run the same topic. Save the new elapsed time as t_sequential.
  6. Compute the speedup. Expected: t_sequential / t_parallel ≈ 2.5× to 4× on a 3-way fan-out. If you're seeing less than 1.5×, your subagents are still running sequentially — check the agent trace and confirm the three Task calls appear in the same assistant response, not consecutive ones. The SDK serializes one-per-response automatically.
  7. Verify: you can quote both numbers (t_parallel and t_sequential) and explain in one sentence why parallel wins — multiple Task calls in a single response are fanned out concurrently by the SDK; one Task per response forces serialization regardless of independence. This is the answer to every Pattern C question ("latency is too high; how do we reduce overhead?") in Domain 1.
3
Force the decomposition failure

Run the system on "impact of AI on creative industries" with a coordinator prompt that's overly procedural. Watch it produce a narrow report. Switch the prompt to one that specifies goals and quality criteria. Re-run. In the scaffold: both prompts are pre-defined as COORDINATOR_PROMPT_PROCEDURAL and COORDINATOR_PROMPT_GOALS in src/coordinator.py. Swap which one ACTIVE_COORDINATOR_PROMPT points at and re-run — no other code change.

  1. Pass A — replace the coordinator prompt with a procedural version. Save your current goal-oriented prompt to a comment or git stash first. Then use this deliberately too-narrow version:
    coordinator.prompt = (
        "You execute a research workflow on the user's topic. "
        "Step 1: call web_search with the topic. "
        "Step 2: call doc_analysis on the documents found. "
        "Step 3: call synthesis on the analysis. "
        "Step 4: call report_gen on the synthesis."
    )
    This prompt is wrong in a very specific way — it tells the model what to do but not what good looks like. You'll see the consequence in step 2.
  2. Run on a topic with hidden breadth. python -m src.main "impact of AI on creative industries". "Creative industries" sounds singular but is actually six or seven distinct sub-domains: visual art, music production, writing, film, photography, game design, fashion. Watch what the coordinator does. With the procedural prompt, it'll typically issue one web_search with the literal topic string, get hits dominated by the most-searched sub-domain (visual art), and run the rest of the pipeline on that narrow input.
  3. Read the final report and list what's missing. Open it in your editor. Note which sub-domains are absent. Common omissions: music production (separately searched), film (industry-specific reporting), game design (often misclassified as tech rather than creative). Write down 3-4 specific gaps.
  4. Pass B — replace the prompt with a goals-and-criteria version. The structure: state the goal first, list quality criteria second, leave the steps to the model.
    coordinator.prompt = (
        "Goal: produce a comprehensive report on the user's topic. "
        "\n\nQuality criteria the final report MUST meet: "
        "\n- Covers at least 5 distinct sub-domains of the topic. "
        "\n- Cites at least 8 sources spanning the past 2 years. "
        "\n- Addresses both creator and consumer perspectives where relevant. "
        "\n\nDecompose into subagent tasks to satisfy these criteria. "
        "You decide the order, the parallelism, and which subagents to use."
    )
    Notice the prompt now defines "comprehensive" with countable criteria. The model can plan against numbers; it can't plan against vibes.
  5. Re-run the same topic. python -m src.main "impact of AI on creative industries". Now the coordinator's first turn will typically include a planning step where it enumerates sub-domains, then a parallel fan-out of web_search calls (one per sub-domain). The final report should cover the sub-domains you listed as missing in Pass A.
  6. Read both reports side by side. Pass A's failure is invisible if you only read Pass A — the report looks plausible on its own. Only the comparison reveals the gap. Print both reports and physically scan them line by line.
  7. Verify: Pass B covers at least three of the sub-domains Pass A missed, and you can name them. If the two reports look similar, your Pass B prompt isn't tight enough — strengthen the quality criteria (e.g., "list and address each sub-domain explicitly in the report") and re-run. This Pass A → Pass B comparison is the exam answer for every "the coordinator is producing narrow results — what change would most effectively address this?" question. The pattern is: procedural decomposition fails; goals-with-criteria succeeds.
4
Structured findings with claim-source mappings

Each finding is { claim, evidence_excerpt, source_url, source_name, publication_date }. Synthesis preserves the mapping through to the final report.

  1. Make web_search return structured hits. The default temptation is to let it return a paragraph of text — don't. Tighten the prompt so every hit is a JSON object with the exact fields you'll need downstream. The verbatim excerpt is the critical one — without it, the synthesis step has to re-fetch every source.
    web_search.prompt = (
        "Given a query, run web searches. Return a JSON list. Each hit MUST be: "
        "{ url, title, snippet, publication_date, relevant_sentence }. "
        "relevant_sentence is a verbatim quote from the source — not a summary. "
        "If you can't find a verbatim sentence on the page, omit the hit entirely. "
        "Return [] if nothing relevant. Return no commentary."
    )
    The "omit the hit" clause prevents the model from inventing excerpts. Models will fabricate quotes confidently when pressured to fill a JSON field.
  2. Make doc_analysis emit findings in the canonical shape. This is the unit of currency between subagents — keep it strict. Document the shape in the prompt and refuse to accept anything else downstream.
    doc_analysis.prompt = (
        "Read the file at the path. Return a JSON list of findings, each: "
        "{ claim, evidence_excerpt, source_url, source_name, publication_date }. "
        "- claim: a single factual statement, max 25 words. "
        "- evidence_excerpt: verbatim, exact characters from the document. "
        "- source_url/source_name/publication_date: from the document metadata. "
        "If a field is missing, return null for that field. Do not invent values."
    )
    Sample finding the model should return:
    {
      "claim": "Global EV market share reached 18% in 2024.",
      "evidence_excerpt": "EVs accounted for 18% of new car sales globally in 2024.",
      "source_url": "https://www.iea.org/reports/global-ev-outlook-2025",
      "source_name": "IEA Global EV Outlook",
      "publication_date": "2025-04-22"
    }
  3. Make synthesis preserve mappings ruthlessly. The natural failure mode for synthesis is to collapse two findings into one prose sentence and drop one of the sources. Forbid it in the prompt.
    synthesis.prompt = (
        "Combine findings into themes. ABSOLUTE RULE: every claim in your output "
        "MUST trace to one or more source URLs from the input. Do NOT combine two "
        "claims unless you keep both source citations. If two findings conflict, "
        "keep BOTH as separate items rather than picking one (recipe 5 handles "
        "conflict explicitly). Output format: a list of themes; each theme has a "
        "list of claims; each claim has its source_url list."
    )
  4. Make the report generator inline-cite or omit. The final report must end every factual sentence with a citation. The simplest enforcement is to make uncited claims an explicit defect.
    report_gen.prompt = (
        "Write a markdown report. RULE: every factual sentence MUST end with an "
        "inline citation in the form [Source Name, YYYY]. If you don't have a "
        "source for a claim, OMIT the claim entirely. An uncited claim is a defect."
    )
  5. Run end-to-end and grep for unsourced claims. python -m src.main "global EV market trends 2024-2025". Open the final report in your editor and search for sentences that end with a period but not a ]. Initial runs typically have a few — the report generator strips citations when the synthesis hands it a poorly-formatted finding. Trace which step lost the source and tighten that prompt.
  6. Validate the URLs by hand. Pick three citations at random. Open the URLs. Confirm the evidence_excerpt appears verbatim on the page. Models will sometimes invent plausible URLs that 404. If any URL fails, your web_search prompt isn't strict enough — re-add the "omit hits without verbatim excerpts" clause.
  7. Verify: the final report has zero unsourced factual sentences, and every random-sampled URL resolves to a page containing the cited excerpt. This claim-source mapping pattern is what makes a research agent's output auditable. The exam's Pattern B ("subagents complete but the output is wrong") almost always tests whether you preserve provenance end-to-end.
5
Inject conflicting sources

Plant two sources with contradicting statistics. Verify the report annotates both with attribution rather than picking one. Verify dates prevent temporal differences from being labelled contradictions.

  1. Plant two contradicting documents in fixtures/documents/. Pick a numeric claim with a clear, single value (market share, percentage, count) and produce two text files where the numbers disagree. Crucially, make the dates differ so the second test case (same-date contradiction) is a clean comparison later.
    # fixtures/documents/iea_2023.txt
    Publication date: 2023-10-15
    Source: International Energy Agency
    EV market share remained at 9% globally in 2022, with growth
    concentrated in China and Europe.
    
    # fixtures/documents/reuters_2024.txt
    Publication date: 2024-04-22
    Source: Reuters
    Global EV sales reached 12.5% market share in 2023, marking the
    first year EV share crossed the 10% threshold worldwide.
    These two aren't actually contradictory — they describe different years. The point of starting with this case is to verify the system correctly recognizes the temporal pattern before you test true contradictions.
  2. Make sure publication dates are extractable. Your doc_analysis needs to populate publication_date from each fixture. Confirm by running just the analysis step on one document and inspecting the output. If publication_date comes back null, your fixtures aren't formatted in a way the model recognizes — add an explicit "Publication date:" line at the top of each.
  3. Update synthesis to handle conflicts and temporal differences distinctly. Add this to the synthesis prompt:
    "When two findings appear to contradict: "
    "- If publication dates differ by MORE THAN 6 months, treat as TEMPORAL "
    "  DIFFERENCE. Present both with dates and attributions, framed as growth "
    "  or change over time. "
    "- If publication dates differ by LESS than 6 months, treat as DISAGREEMENT. "
    "  Present both with attributions; do NOT pick one. "
    "- If one finding has no date, treat as disagreement."
  4. Run the temporal case. python -m src.main "global EV market share trends". Read the final report's section on market share. Expected output style: "EV market share has grown rapidly: from 9% globally in 2022 (IEA Global EV Outlook, 2023) to 12.5% in 2023 (Reuters, 2024)." Both numbers, both sources, framed as change over time. If you instead see one number with one source, synthesis is silently picking a winner — strengthen the prompt.
  5. Plant a same-date contradiction. Now add two more fixtures with the SAME publication year but different numbers:
    # fixtures/documents/reuters_2024_a.txt
    Publication date: 2024-04-22
    Source: Reuters
    Global EV market share reached 12.5% in 2023.
    
    # fixtures/documents/bloomberg_2024.txt
    Publication date: 2024-04-30
    Source: Bloomberg New Energy Finance
    Global EV market share hit 14% in 2023, exceeding earlier projections.
    The dates are within 6 months of each other; this should now trigger the "disagreement" path.
  6. Re-run. Expected output: "Sources disagree on 2023 global EV market share: Reuters reports 12.5% (April 2024); Bloomberg reports 14% (April 2024). The discrepancy may reflect different methodologies or country coverage." Both numbers, both sources, explicitly framed as disagreement. If the report silently picks one, the synthesis prompt isn't catching the same-date case — print synthesis's intermediate output to debug.
  7. Verify: the temporal case produces a "growth over time" sentence with both attributions; the same-date case produces a "sources disagree" sentence with both attributions. This source-level discipline — never silently pick a winner, always distinguish disagreement from change over time — is what the exam's Pattern B questions reward.
6
Add a scoped verify_fact tool to synthesis

For the simple-verification common case. Complex verifications still route through the coordinator. Measure the round-trip reduction.

  1. Define verify_fact as a tool, not a subagent. The distinction matters: tools run inline within a subagent's turn; subagents are full LLM spawns with their own context. Verifying "is this number on this page?" doesn't need its own LLM — it needs a quick fetch + grep. Implement it accordingly:
    # tools.py (in your existing project, alongside the lab 1 tools)
    import requests
    def verify_fact(claim: str, source_url: str) -> dict:
        try:
            page = requests.get(source_url, timeout=8).text
            verified = any(part.lower() in page.lower() for part in claim.split() if len(part) > 4)
            return {"verified": verified, "url": source_url, "method": "substring"}
        except Exception as e:
            return {"verified": False, "error": str(e), "url": source_url}
    Real-world verification is more sophisticated than substring match; this is enough for the lab. The point is to give synthesis a fast in-loop check, not perfect verification.
  2. Add verify_fact ONLY to synthesis.allowedTools. Do not add it to the coordinator or anywhere else. The whole point is scoped: simple verifications happen inside synthesis without a coordinator round-trip.
    synthesis = AgentDefinition(
        name="synthesis",
        description="Combine findings; verify simple claims inline.",
        allowedTools=["verify_fact"],   # NOTE: was []
        prompt="...",
    )
  3. Update the synthesis prompt to distinguish simple vs complex verifications. The model needs a rule for when to use the in-loop tool vs when to bubble up to the coordinator. Make it crisp:
    synthesis.prompt += (
        "\n\nVerification: before emitting a finding, decide: "
        "\n- If the claim is a SINGLE numeric or date value AND has ONE source URL, "
        "  call verify_fact(claim, source_url) inline. "
        "\n- If the claim spans MULTIPLE sources, or is qualitative, emit a finding "
        "  with verification_needed: true and let the coordinator schedule deeper checks."
    )
  4. Instrument round-trips. Add a counter that increments on every Task spawn the coordinator makes. Print it at the end. You'll compare before-and-after.
    # in your runner
    spawn_count = 0
    # hook into AgentRunner so spawn_count increments on every Task call
    print(f"\n--- coordinator spawned {spawn_count} subagents ---")
  5. Run a topic with many numeric claims. Something with statistics: python -m src.main "global EV adoption statistics by region 2023-2024". Note the spawn count and elapsed time. Then disable verify_fact (remove it from allowedTools) and re-run. Expected: ~30-50% fewer spawns with verify_fact enabled, because synthesis handles trivial verifications inline.
  6. Read both transcripts to confirm scope. In the with-tool run, synthesis's transcript should show verify_fact calls happening within its single turn. In the without-tool run, the coordinator should show extra Task spawns to a verification step. Spot-check that complex multi-claim verifications still go through the coordinator in both cases — synthesis shouldn't try to verify multi-source claims inline.
  7. Verify: simple numeric verifications are visibly handled inside synthesis (you can quote a line from its trace); complex multi-claim verifications still bubble up. The boundary you're learning to draw — scoped tools for tight in-loop work, subagent delegation for everything else — is the answer to every "where should this work happen?" question in Domains 1 and 2.
7
Simulate a web-search timeout

Verify the subagent returns structured error context (failure type, attempted query, partial results) rather than a generic "search unavailable," and that the coordinator can proceed with partial results plus a coverage-gap annotation. In the scaffold: set SIMULATE_TIMEOUT = True at the top of src/coordinator.py. The invoke_subagent function short-circuits the web_search path with a structured {"isError": true, "errorCategory": "transient", "failure_type": "timeout", "partial_results": [], "alternative_approaches": [...]} payload — exactly the shape the recipe describes.

  1. Plant a deterministic timeout. Random failures are good for recipe 1's structured-errors work, but here you want a specific failure you can reproduce on demand. Make the web_search tool fail on a specific seed query:
    # inside web_search's underlying tool wrapper
    if "renewable storage 2024" in query.lower():
        return {
            "isError": True,
            "errorCategory": "timeout",
            "attemptedQuery": query,
            "partialResults": [
                {"url": "...", "title": "...", "publication_date": "2024-03-15"},
                {"url": "...", "title": "...", "publication_date": "2024-08-02"},
            ],
            "message": "Search timed out at hit 3/5; partial results returned.",
        }
    Note the structure: it's not just {"error": "timeout"}. The error includes attemptedQuery, partialResults, and a human-readable message. The upstream code uses all three.
  2. Update the coordinator's prompt to handle structured errors. Without this instruction the coordinator's default behavior is either to terminate or to silently produce a partial answer marked as success. Both are anti-patterns. Add to the prompt:
    "If a subagent returns isError: true with partialResults: "
    "1. INCORPORATE the partial results into your synthesis. "
    "2. ADD a 'coverage gap' section to the final report that names the "
    "   failed query and quantifies the missing data (e.g., '3 of 5 expected "
    "   sources missing'). "
    "3. Do NOT terminate the run. "
    "4. Do NOT mark the report as complete without the coverage-gap note. "
    "If isError: true but no partialResults, attempt the search ONCE with a "
    "reformulated query before degrading gracefully."
  3. Run a topic that triggers the planted timeout. python -m src.main "energy storage trends including renewable storage 2024". The "renewable storage 2024" substring will hit your planted failure. Watch the coordinator's transcript.
  4. Inspect the final report. Expected: it includes a section like "Coverage gap: the search for 'renewable storage 2024' timed out after returning 2 of 5 expected sources. The findings below reflect partial data; an additional 60% of expected sources were not analyzed." The two partial-result documents are still incorporated. The report is complete enough to be useful, honest enough to be auditable.
  5. Test the anti-patterns to feel why the contract matters.
    • Anti-pattern A: remove the "ADD a coverage gap section" instruction. Re-run. The report now silently produces a partial answer with no gap annotation. Look at it — you can't tell anything went wrong unless you compare to the no-timeout run. That's the failure mode.
    • Anti-pattern B: change web_search to return just "search unavailable" instead of the structured shape. Re-run. The coordinator either terminates the whole workflow or invents details to fill the gap. Both are catastrophic in production.
    Restore the correct behavior after these tests.
  6. Verify: the planted-timeout run produces a complete final report that explicitly names the coverage gap. You can quote the gap sentence verbatim. The "how should this failure flow back?" exam pattern is exactly this contract: structured errors carry partial state and a category; the upstream agent annotates rather than swallows. When you see Domain 5 questions about error propagation, this is the answer.
What success looks like
  • One coordinator request produces a final report citing at least three distinct sources, each tagged with URL and publication date.
  • Timing the parallel-spawn version vs sequential: the parallel version should be at least 2× faster on a four-subagent fan-out. (If it isn't, you're spawning across turns instead of in one response.)
  • The decomposition failure case: with a procedural coordinator prompt, the report misses obvious topic areas. With a goals-and-criteria prompt, it covers them. Both runs are the lesson.
  • A planted contradiction in two fixture documents shows up in the final report as two annotated values with attribution, not one value silently chosen.
  • Killing the web_search subagent mid-run produces a final report that explicitly notes coverage gaps, rather than terminating the workflow or silently producing a partial answer marked as success.

Lab 3: Claude Code for a real team workflow

≈ 2 hours · domains 2, 3

Goal: Configure a project the way Domain 3 questions assume it's configured. By the end you'll have personally observed the precedence rules between user/project/path-scoped configuration, which is what most Domain 3 questions actually test.

Before you start
  • Install Claude Code. claude --version should print something. If not, follow the Claude Code guide install steps first.
  • A repo you can break safely. Don't do this lab in production code. Create a sandbox:
    mkdir lab3-cc-config && cd lab3-cc-config
    git init && mkdir -p src/api src/web tests
    echo "console.log('hello')" > src/web/main.js
    echo "describe('x', ()=>{})" > tests/x.test.js
    echo "exports.handler = async () => {}" > src/api/handler.js
    git add . && git commit -m "scaffold"
  • Have a teammate or second machine for step 3 (project-scoped slash command verification). If neither is available, you can simulate by cloning the same repo into a second directory under a different user shell.
  • One env var for step 5. Pick any token-shaped value you have lying around — even a fake one is fine for verifying expansion. export API_TOKEN=demo-not-a-real-secret in the shell you'll run claude from.
  • Reference docs: the Claude Code guide on this site is the single best companion for this lab. Keep it open in a tab; the recipes below name the exact features it covers.
1
CLAUDE.md at three levels

Create three CLAUDE.md files with distinguishable markers so you can see the merge. The merge is determined by your working directory, not by which files exist. That's the lesson.

  1. Set up the user-level CLAUDE.md. The user-level file applies on every Claude Code session you run on your machine, on any repo. It's where personal preferences live ("I prefer terse responses," "default to Python 3.12"). Create the file if it doesn't already exist:
    mkdir -p ~/.claude
    echo "USER-LEVEL: prefer concise responses." >> ~/.claude/CLAUDE.md
    The marker text is deliberately silly — you want something instantly recognizable in the /memory output. If you already have content in ~/.claude/CLAUDE.md, append this line; don't overwrite.
  2. Set up the project-root CLAUDE.md. The project-root file applies to anyone running Claude Code anywhere in this repo. It's where team standards live ("indent with 2 spaces," "use Vitest, not Jest"). Create it at the repo root of your sandbox:
    echo "PROJECT-ROOT: this repo uses 2-space indent." > CLAUDE.md
    git add CLAUDE.md && git commit -m "add project CLAUDE.md"
    The commit matters — project CLAUDE.md is part of the repo, so teammates get it on clone. That's the whole point of "project-level" vs "user-level."
  3. Set up the subdirectory CLAUDE.md. This is the level most people don't know about. A CLAUDE.md inside a subdirectory of the repo only activates when Claude's working directory is inside that subdirectory. It's where area-specific conventions live ("API handlers must return {statusCode, body}"). Create it:
    echo "API-AREA: every handler returns { statusCode, body }." > src/api/CLAUDE.md
    git add src/api/CLAUDE.md && git commit -m "add api/CLAUDE.md"
    Note the path — it must be exactly src/api/CLAUDE.md, not src/CLAUDE.md or api/CLAUDE.md. Subdirectory matching is exact-prefix on your CWD.
  4. Run Claude from the repo root and inspect /memory. The /memory slash command (built into Claude Code) shows you which CLAUDE.md files are currently loaded. From the repo root:
    cd ~/path/to/lab3-cc-config   # the repo root
    claude
    # At the prompt, type: /memory
    Expected output: a list of loaded files including ~/.claude/CLAUDE.md (USER-LEVEL) and ./CLAUDE.md (PROJECT-ROOT). src/api/CLAUDE.md should NOT be in the list. Type /exit to quit.
  5. Now cd into src/api/ and run Claude again. Same machine, same repo, same files on disk — only your working directory changed.
    cd src/api/
    claude
    # /memory
    Expected: all three CLAUDE.md files in the loaded list. The subdirectory file activates because you're now inside its scope.
  6. Verify the directory-driven merge. Open both /memory outputs side by side. The same three files exist on disk in both runs; only the working directory differs; the merged context differs accordingly. Common failure modes:
    • All three files load from the repo root → your src/api/CLAUDE.md is actually at src/CLAUDE.md. Move it.
    • Only USER-LEVEL loads from the repo root → you forgot the git commit of ./CLAUDE.md in step 2 (uncommitted files do load, but it's a sign you forgot something).
    • /memory shows zero files → claude isn't recognizing the directory. Check that you actually changed directories before launching.
    This directory-driven merge is the entire mechanic the exam tests in Pattern E ("where should this configuration live?"). Project-wide rules in the root, area-specific rules in subdirectories, personal preferences in ~/.claude.
2
Path-scoped rules with globs

Path-scoped rules attach to specific files via globs, so a rule loads only when relevant. This is what saves your global CLAUDE.md from becoming 800 lines.

  1. Why .claude/rules/ exists. CLAUDE.md hierarchies are great for directory-scoped rules, but they don't help when a rule should apply to file patterns regardless of where they live ("every test file in the repo, wherever it is"). Path-scoped rules fill that gap. They live in .claude/rules/ at the repo root and use a YAML frontmatter paths: field to declare which files they activate for.
  2. Create .claude/rules/tests.md. Put a glob in the frontmatter that matches your project's test naming convention. The body is the rule itself.
    mkdir -p .claude/rules
    cat > .claude/rules/tests.md <<'EOF'
    ---
    paths: ["**/*.test.*"]
    ---
    TESTS-RULE: always use describe/it blocks, never bare assertions.
    Mock external HTTP with msw, not jest.mock.
    EOF
    The ** in the glob means "any depth." So tests/foo.test.js and src/feature/bar.test.ts both match.
  3. Create .claude/rules/api.md with a directory-scoped glob. This rule should only apply to files under src/api/, regardless of extension.
    cat > .claude/rules/api.md <<'EOF'
    ---
    paths: ["src/api/**/*"]
    ---
    API-RULE: every handler logs a request_id at info level on entry.
    Errors must return { statusCode, body: { error: { code, message } } }.
    EOF
    git add .claude/rules && git commit -m "add path-scoped rules"
    Commit these — they're team artifacts, just like CLAUDE.md.
  4. Test case 1: a file that matches no rules. In Claude, ask it to edit src/web/main.js (a regular source file under web, not API, not tests).
    claude
    # "edit src/web/main.js and add a console.log at the top"
    # Then: /memory
    Expected: /memory shows your CLAUDE.md files but neither rule. The file matches neither glob.
  5. Test case 2: a file that matches only the tests rule. Ask Claude to edit tests/x.test.js. Then /memory. Expected: TESTS-RULE in the loaded set, API-RULE not.
  6. Test case 3: a file that matches only the api rule. Ask Claude to edit src/api/handler.js. Then /memory. Expected: API-RULE loaded, TESTS-RULE not.
  7. Verify the 2×2 grid. Three observations (one per test case) should fill a small mental table: file path × rule. Only matching cells are populated. If a rule is loading on a file it shouldn't:
    • Your paths: glob is too loose. Test it independently with git ls-files | grep -E "your_pattern" to see what it actually matches.
    • Globs use ** for any depth and * for a single segment. src/api/* matches one level; src/api/**/* matches all depths.
    The path-scoping mechanic is what makes a 50-rule project tractable. Without it, every rule lives in the global CLAUDE.md and pollutes every conversation.
3
Project-scoped slash command

Project commands live in the repo; user commands live in your home directory. Project-scoped = team artifact; user-scoped = personal shortcut.

  1. Why slash commands matter here. Slash commands are reusable prompts. A team-shared /review command guarantees everyone runs PRs through the same checklist; a personal /draft-email command is your own shortcut. The location of the file decides the scope. The exam questions in Pattern E ("where should this configuration live?") consistently test this distinction.
  2. Create the project-scoped /review command. Project commands live in .claude/commands/ at the repo root. The filename becomes the command name. The body is the prompt that runs when the command is invoked.
    mkdir -p .claude/commands
    cat > .claude/commands/review.md <<'EOF'
    ---
    description: Quick PR review checklist
    ---
    Review the current diff for:
    (1) untested edge cases,
    (2) error paths and what happens when they fire,
    (3) public API changes that need migration notes,
    (4) any commented-out code that shouldn't ship.
    
    Output as a markdown checklist with brief justifications.
    EOF
    The description field in the frontmatter is what shows up in the slash-command picker, so make it skimmable.
  3. Commit the command. This is the whole point of project-scoped — it travels with the repo. Without the commit, only your machine has it.
    git add .claude/commands/review.md
    git commit -m "add /review slash command"
  4. Verify on your machine. Open claude in the repo, type / at the prompt. Expected: /review appears in the picker with its description. Type /review and run it on any diff to confirm it produces the checklist.
  5. Verify on a teammate's clone (or simulated second user). Clone the repo to a second directory under a different shell (a separate teammate is ideal; a fresh git clone in another folder works too). cd in, run claude, type /. Expected: /review still there — the file traveled with the repo because you committed it.
  6. Now create a user-scoped command for contrast. User commands live in ~/.claude/commands/. They're never committed; they're your personal shortcuts.
    mkdir -p ~/.claude/commands
    cat > ~/.claude/commands/scratch.md <<'EOF'
    ---
    description: Brainstorm a quick draft on any topic
    ---
    Brainstorm three rough takes on the topic the user provides.
    No structure, no headers — just three angles.
    EOF
    In your sessions, /scratch shows up on any repo. In the teammate's clone of the lab repo, /scratch does NOT show up — they have their own home directory.
  7. Verify the asymmetry directly. In your session: / picker shows both /review and /scratch. In the teammate's clone: /review appears (project), /scratch does not (user, theirs would be different). That asymmetry — project commands ride along with the repo, user commands are local — is the answer to every "where should this configuration live?" Pattern E question. The bug pattern people get burned by: putting an organization-wide command in ~/.claude/commands/ and wondering why teammates don't have it.
4
An isolated skill with context fork

context: fork runs a skill in its own conversation. The main thread sees only the skill's final answer, not its intermediate reasoning or file contents. That isolation keeps the main context lean.

  1. Why isolated skills matter. A skill that reads a big file (a changelog, a long log, a CSV) normally pulls the whole file into the conversation context — and stays there forever, eating tokens on every subsequent turn. context: fork says "run this skill in a parallel conversation; bring back only the final answer." The main conversation never sees the raw bulk. For frequently-invoked tools (changelog scans, dependency audits, log grepping), this is the difference between a clean session and one that hits context limits by the end of the day.
  2. Create the skill file. Skills live in .claude/skills/<name>/SKILL.md. The directory is named after the skill; the SKILL.md inside contains frontmatter (metadata) and a body (the skill's instructions).
    mkdir -p .claude/skills/scan-changelog
    cat > .claude/skills/scan-changelog/SKILL.md <<'EOF'
    ---
    name: scan-changelog
    description: Summarize a CHANGELOG.md and extract breaking changes
    context: fork
    allowed-tools: ["Read", "Grep"]
    argument-hint: "<path to CHANGELOG.md>"
    ---
    Read the file at the provided path. Return ONLY:
    - A one-line summary of the most recent release.
    - A bullet list of every "BREAKING:" line.
    
    Do not read any files other than the path provided.
    Do not summarize anything beyond the most recent release
    and the breaking-changes list.
    EOF
    The frontmatter fields matter:
    • context: fork — the whole point. Without this, the skill runs inline and pulls everything into your main context.
    • allowed-tools — restrictive list. The skill can only Read and Grep; it can't Write, can't run Bash, can't fetch URLs. Tight scopes are how you sleep at night.
    • argument-hint — what the user should pass when invoking. Shows up in the picker.
  3. Create a fake CHANGELOG.md to scan. Make it long enough that loading it would be noticeable. 200-500 lines is plenty.
    cat > CHANGELOG.md <<'EOF'
    # v0.5.0 (2026-05-01)
    - Added support for streaming responses.
    - BREAKING: renamed config field "api_key" to "auth_token".
    - Improved error messages for rate limits.
    - BREAKING: minimum Python version is now 3.11.
    
    # v0.4.2 (2026-03-12)
    - Fixed a race condition in the cache layer.
    - ... (paste more sections so the file is sizable) ...
    EOF
  4. Measure baseline context size. Open claude. Before invoking the skill, run /context and note the token count. Call this tokens_before.
  5. Invoke the skill. Type /scan-changelog ./CHANGELOG.md. Expected: the skill prints a one-line summary plus a bullet list of BREAKING: lines. The output is brief — three or four bullets, max.
  6. Re-measure context. Run /context again. Compare to tokens_before. Expected: the count has barely moved — only the skill's final output (a few hundred tokens) entered your context, not the full changelog (potentially thousands).
  7. Run the skill two more times. Same invocation. After each run, check /context. Expected: context grows linearly by the size of the brief output, not by the size of the changelog. Three invocations of a fork skill cost roughly the same as one invocation of an inline skill on the same file.
  8. Verify the contrast with no-fork. Edit SKILL.md and comment out the context: fork line (or change it to context: inline). Reset your session. Re-run /scan-changelog. Now /context jumps by the size of the CHANGELOG.md each invocation. Restore context: fork when you're done. That difference is the entire pragma. For any skill that reads bulk data, fork should be the default. Inline is for skills whose intermediate state you want to keep in the main conversation (rare).
5
Project + user MCP servers

Project MCP travels with the repo and uses env-var indirection so secrets don't get committed. User MCP is yours alone. Both can be active at once.

  1. Why MCP scope matters. MCP (Model Context Protocol) servers expose tools to Claude. Where you declare a server determines who gets it: .mcp.json in the repo root means "everyone who clones this repo gets this server"; ~/.claude.json means "only me." The exam's Pattern E question regularly tests whether you know the difference and the env-var-expansion mechanic that prevents secrets from leaking into committed files.
  2. Create the project-scoped server. Project MCP lives in .mcp.json at the repo root. Use the env-var expansion syntax (${VAR}) for anything secret — the JSON gets committed, but the JSON only contains the placeholder, never the actual secret.
    cat > .mcp.json <<'EOF'
    {
      "mcpServers": {
        "demo-fs": {
          "command": "npx",
          "args": ["-y", "@modelcontextprotocol/server-filesystem", "."],
          "env": { "AUTH_TOKEN": "${API_TOKEN}" }
        }
      }
    }
    EOF
    git add .mcp.json && git commit -m "add project MCP server"
    The ${API_TOKEN} is read from your shell environment at server-launch time. If you committed the literal token, you've leaked a secret into git history.
  3. Create a personal MCP server. User-scoped servers live in ~/.claude.json. Use a different name from the project server so they don't collide. A second filesystem server pointed at a scratch directory is a fine demo target. Don't append with >> — that file is JSON, and an append produces invalid JSON. Open it in your editor and merge the personal-fs entry under the existing mcpServers key (or add the mcpServers key if it doesn't exist yet):
    mkdir -p /tmp/scratch   # so the server has something to point at
    # Open ~/.claude.json in your editor and add:
    {
      "mcpServers": {
        "personal-fs": {
          "command": "npx",
          "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp/scratch"]
        }
        // ...any existing entries here, merged alongside...
      }
      // ...other top-level keys preserved...
    }
    Validate with jq . ~/.claude.json > /dev/null && echo OK after the edit. If jq errors, you have a syntax problem to fix before launching claude — otherwise you'll spend ten minutes wondering why no slash commands work.
  4. Confirm $API_TOKEN is set in the shell you'll run Claude from. The expansion happens at server-launch time, which is when claude starts — so the var must be in that shell's environment, not just present somewhere on disk.
    echo $API_TOKEN
    # if blank:
    export API_TOKEN=demo-not-a-real-secret
  5. Start Claude from the repo root and inspect /mcp. The /mcp slash command shows all MCP servers currently loaded and their connection state.
    cd ~/path/to/lab3-cc-config
    claude
    # /mcp
    Expected: both demo-fs and personal-fs listed as connected. If only one is listed, the other failed to start — read the /mcp details for the error.
  6. Try a tool from each server. Both filesystem servers expose tools like read_file and list_directory. Ask Claude to list files in the repo (should hit demo-fs, scoped to .) and then list files in /tmp/scratch (should hit personal-fs). Both should succeed. Watch which server gets routed to in the trace.
  7. Verify the env-var failure mode. This is the most instructive part. In the shell, unset the token and restart Claude:
    unset API_TOKEN
    claude
    # /mcp
    Expected: demo-fs shows a "missing environment variable: API_TOKEN" error, not a silent fallback to a blank value. Re-export the var and restart to recover. The visibility of that failure is the lesson — env-var expansion is a real dependency, and the system tells you when it's broken rather than papering over it. Pattern E questions reward knowing exactly this: project MCP + env-var expansion = secrets stay out of git, but the dependency is explicit.
6
Plan vs direct on three scopes

Run three tasks back-to-back, deliberately choosing the right mode each time. The exam's Pattern A and Pattern D questions reward exactly this kind of "match the mode to the blast radius" thinking.

  1. Why plan mode exists. Default mode lets Claude make edits as it reads. That's great when the work is small and reversible (a typo, a comment). It's catastrophic when the work spans many files and a mistake means an hour of git reset and re-explanation. Plan mode constrains Claude to read-only until it produces a written plan you approve. Picking the right mode for a task isn't a style preference; it's a blast-radius decision.
  2. Direct mode — a single-file fix. First, plant a small bug to fix. Open src/web/main.js and introduce a typo: change console.log to cosnole.log. Save. Now ask Claude in default mode:
    claude
    # at the prompt:
    "There's a typo in src/web/main.js — find and fix it."
    Expected: Claude reads the file, identifies the typo, edits it, summarizes the change. One file, one line, fully reversible. This is what direct mode is for — work you could describe in advance, on a small surface.
  3. Plan mode — a wide rename. Restart in plan mode. Two ways to enter:
    claude --permission-mode plan
    # OR start normally and press Shift+Tab to toggle into plan mode
    Ask: "rename every variable named foo to bar across the entire repo, including tests and comments". Claude will read files (you'll see Read tool calls) but will NOT edit anything. It produces a written plan listing every file it intends to touch and the exact changes.
  4. Read the plan critically before approving. Check: did it find every file? Did it correctly avoid string literals or filenames that contain "foo" but shouldn't be renamed? Are there ambiguous cases (e.g., a function named fooBar — should that become barBar or stay)? If the plan looks wrong, type your corrections and ask Claude to revise. Only approve when the plan is right.
  5. Plan-then-execute — a new feature. Restart in plan mode again. Ask: "add a /healthz endpoint to src/api/ that returns {ok: true} and logs a request_id". Read the plan. This time the plan covers code you didn't yet have — a fresh file, an entry in a router, possibly tests.
  6. Edit the plan inline. Press Ctrl+G (or the IDE-specific equivalent) to open the plan in your editor. Tweak it: change the endpoint path, adjust the response shape, add a test you want. Save the file; Claude picks up the edits and the plan is now your hybrid. Approve.
  7. Verify by articulating each choice. Write one sentence for each task explaining why the mode you used was right:
    • Typo fix → direct. The diff is one character. You can describe it before Claude touches anything. Plan mode would be ceremony.
    • Wide rename → plan. Easy to miss a file or accidentally rename a string. The plan is your chance to catch errors before they're committed.
    • New feature → plan-then-execute. The shape isn't fully specified — you want to shape it before Claude commits. The plan is a low-cost design-review surface.
    If any of your three reasons feels like "I just guessed which mode to use," redo the task in the wrong mode and feel why it's wrong. This blast-radius-driven mode selection is exactly what the exam's Pattern A ("how would you address this") and Pattern D ("how should this fail back") questions reward.
What success looks like
  • Running /memory from repo root vs src/api/ shows visibly different merges — your three CLAUDE.md markers are the proof. If they all show up regardless of directory, your subdirectory CLAUDE.md isn't being scoped; check the file's location.
  • Editing tests/x.test.js loads only TESTS-RULE; editing src/api/handler.js loads only API-RULE; editing src/web/main.js loads neither. If a rule is loading on a file it shouldn't, your paths: glob is too loose.
  • /review appears in the slash-command picker in both your session and a teammate's clone (after they pull). The user-scoped /scratch shows up only in your sessions, never in theirs. That asymmetry is the lesson.
  • The scan-changelog skill returns its summary, but /context in the main conversation doesn't grow by the size of the changelog. Run it a second time and /context still doesn't grow. That's the fork doing its job.
  • /mcp shows both servers connected. Unset $API_TOKEN in your shell, restart claude; the project server now fails to connect with a clear "missing env var" message, not a silent fallback. That failure path is the lesson — env-var expansion is real, not optional.
  • You can articulate, in one sentence each, why you used direct mode for the typo, plan mode for the rename, and plan-then-execute for the new endpoint. If all three feel like "I just guessed," redo them deliberately.

Lab 4: Structured extraction with validation and batch

≈ 2 hours · domains 4, 5

Goal: Build the extraction system from Scenario 6. By the end, you will have practiced every Domain 4 schema-design decision the exam tests.

Fast pass

Skip the plumbing, keep the lessons

Pydantic schema with optional/nullable, enum + "other" + detail, and an "unclear" enum value. Tool_use call with forced tool_choice. Ten invoice fixtures including five clean ones and five planted edge cases (missing fields, bad totals, informal dates, off-enum category, internal contradiction). Validators and batch are stubbed for steps 5 and 6.

★ Open Lab 4 on GitHub git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab4-extract
Before you start
  • Reuse the environment shape from Lab 1. Inside lab4-extract/, run pip install -r requirements.txt (which installs anthropic, pydantic, and python-dotenv). Don't pip install them by hand — the requirements file pins versions that match the scaffold.
  • Pick one document type and stick with it. The scaffold uses invoices because the fields are unambiguous (vendor, line items, totals, dates) and edge cases are easy to plant. If you're building from scratch, three good choices in order of difficulty:
    • Invoices (easiest, what the scaffold uses): clear required fields, plus standard edge cases. The scaffold's 10 fixtures already cover the planted-problem matrix.
    • Scientific abstracts: varied structure (inline citations vs bibliographies) is great practice for step 3's few-shot examples.
    • SEC 10-K excerpts (hardest): long, heterogeneous, lots of nullable fields. Skip on first pass unless you're comfortable.
  • Don't start with the batch API. Get the synchronous flow working on five documents first; step 6 introduces batching only after the schema is reliable. Most lab failures come from people skipping ahead to batching and debugging two layers at once. The scaffold's src/batch.py is intentionally stubbed for exactly this reason.
  • What's already scaffolded vs what you'll do: the repo ships a working src/schema.py (Pydantic Invoice + LineItem + ExpenseCategory enum, plus the auto-generated INVOICE_TOOL_SCHEMA), a working src/extract.py sync extractor with forced tool_choice and few-shot examples, and ten fixtures in fixtures/ (named invoice_001.txt through invoice_010_conflict.txt — the suffixes name the planted problem). You'll fill in three TODOs: retry_with_feedback in extract.py (step 3), validate_totals_match in validate.py (step 5), and the three functions in batch.py (step 6).
  • Tool name: the scaffold calls its extraction tool record_invoice, not extract_invoice. Cosmetic difference; the recipes use extract_invoice for legibility — pick one and stay consistent. (If you're using the scaffold, leave it as record_invoice; if you change the name, update both INVOICE_TOOL_SCHEMA["name"] and the tool_choice dict in extract.py.)
  • Reference docs: the tool use guide and the Message Batches API reference. Note Batches has no multi-turn tool calling; you can't do validation-retry inside a batch request.
1
Define an extraction tool

JSON schema with required, optional (nullable), and enum fields. One enum should have an "other" value paired with a detail string field. Use tool_choice set to force this specific tool first.

  1. Open schema.py and define the Pydantic model. Pydantic v2 gives you typed fields, automatic JSON schema generation, and validation in one package — that's why extraction labs use it. The shape decisions you make here drive every later recipe, so be deliberate. The single most important call is which fields are required (no default) vs optional/nullable (Optional[T] = None). Required means the model MUST emit a value, even if it has to invent one. Optional means "absent is fine."
    from pydantic import BaseModel
    from typing import Optional, Literal
    from datetime import date
    
    class LineItem(BaseModel):
        description: str
        quantity: int
        unit_price: float
    
    class Invoice(BaseModel):
        vendor_name: str                       # required
        invoice_date: Optional[date] = None    # optional / nullable
        line_items: list[LineItem]             # required, at least []
        total: float                           # required
        payment_terms: Literal["net30", "net60", "due_on_receipt", "other"]
        payment_terms_detail: Optional[str] = None   # paired with "other"
    Pay attention to payment_terms + payment_terms_detail. The enum has an "other" escape hatch; the detail field captures whatever the actual phrase was. This pairing is what saves you when an invoice says "due on completion" — that's not in your enum, but the model can mark it as "other" with payment_terms_detail: "due on completion" instead of misclassifying or fabricating.
  2. Generate the JSON schema from the Pydantic model. Pydantic v2's .model_json_schema() emits a clean JSON Schema dict that you can hand directly to the Anthropic API. Don't write the schema by hand — keeping the Pydantic model as the single source of truth means there's nothing to drift.
    schema_dict = Invoice.model_json_schema()
    # print(schema_dict) once to inspect; it should include
    # required, properties, $defs (for LineItem), enum on payment_terms.
  3. Build the tool definition in extract.py. The Anthropic API expects a list of tools where each has name, description, and input_schema. Plug the Pydantic-generated schema directly into input_schema. The description is what tells the model when/how to use the tool — keep it specific.
    from anthropic import Anthropic
    from schema import Invoice
    
    client = Anthropic()
    
    tools = [{
        "name": "extract_invoice",
        "description": (
            "Extract fields from an invoice document. "
            "Use the provided schema exactly. Return null for absent fields; "
            "do not invent values. Use payment_terms='other' with a "
            "payment_terms_detail string for terms not in the enum."
        ),
        "input_schema": Invoice.model_json_schema(),
    }]
    The two clauses about "return null for absent" and "use 'other' for off-enum" are doing the work that recipes 2 and 5 verify. Without them in the description, the model's default behavior is to fabricate.
  4. Force the tool with tool_choice. When extraction is the entire purpose of the call, you don't want the model to optionally use the tool — you want it to use this exact tool, always. tool_choice handles that.
    def extract(document_text: str) -> Invoice:
        resp = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            tools=tools,
            tool_choice={"type": "tool", "name": "extract_invoice"},
            messages=[{
                "role": "user",
                "content": f"Extract this invoice:\n\n{document_text}",
            }],
        )
        # The response will contain exactly one tool_use block.
        tool_use = next(b for b in resp.content if b.type == "tool_use")
        return Invoice(**tool_use.input)
    If you omit tool_choice, the model sometimes responds with prose ("Here's what I extracted: ...") and you have to parse text out of natural language. tool_choice guarantees structured output, every time.
  5. Parse the response into a Pydantic model. The Invoice(**tool_use.input) call validates the model's output against your schema. If anything is off (wrong type, missing required field, enum violation), Pydantic raises ValidationError. You'll use that exception in recipe 3 to drive the retry loop. For this recipe, let it propagate so you can see when it fires.
  6. Smoke-test on one clean fixture. The scaffold ships clean invoices as invoice_001.txt through invoice_005.txt; the rest have planted problems. Run the extractor from the lab's root directory:
    python -m src.extract fixtures/invoice_001.txt
    Expected output:
    --- extracted ---
    {
      "vendor_name": "ACME LIGHTING CO.",
      "vendor_address": "123 Industrial Way, Portland, OR 97201",
      "invoice_number": "INV-2026-0142",
      "issue_date": "2026-04-15",
      ...
      "stated_total": 1787.60,
      "calculated_total": 1787.60,
      "expense_category": "other",
      "expense_category_detail": "lighting fixtures",
      "conflict_detected": false,
      "conflict_details": null,
      "notes": null
    }
    
    --- semantic validation ---
    OK
    Confirm: every meaningful field is populated (no null for vendor / dates / line items / totals), and stated_totalcalculated_total. The validator prints OK only because validate_totals_match is a TODO stub returning None — until recipe 5 is done, every fixture prints OK regardless of its math.
  7. Verify the "other" branch specifically. Plant an invoice with payment terms that don't match the enum (e.g., "net 45 days" or "due upon completion"). Run extraction. Expected: payment_terms == "other" AND payment_terms_detail contains the actual phrase. If "other" is selected with a null detail, the schema isn't expressing the pairing tightly enough — try adding a Pydantic @model_validator that enforces the constraint. If the model never picks "other" and instead misclassifies as the closest enum value, the tool description doesn't make the escape hatch explicit enough.
2
Process documents with absent fields

Extract from documents where some fields don't appear. Verify the model returns null rather than fabricating values. Then mark those fields required and watch hallucination kick in. Revert. The side-by-side is the lesson.

  1. Use the planted missing-fields fixture. The scaffold ships fixtures/invoice_006_missing.txt which deliberately lacks several required-looking fields. (If you're building from scratch, edit a clean invoice and delete the date line.) The point is that no human reading the document could fill in the missing fields; the information is not there.
  2. Run extraction with the scaffold's defaults (everything Optional). The scaffold's Invoice model has every field as Optional:
    python -m src.extract fixtures/invoice_006_missing.txt
    Expected output: the missing fields come back as null. The model correctly admits they're absent. The tool description's "do not fabricate values" clause carries this; without it the model would often invent even with an optional schema.
  3. Now make a field required and re-run. Edit src/schema.py and flip one of the missing fields — e.g., vendor_name — to required:
    class Invoice(BaseModel):
        vendor_name: str                       # was Optional[str] = None — now required
        ...
    Re-run the same fixture. Don't change anything else — same document, same prompt.
  4. Inspect the fabricated value. Expected: the model invents a plausible-looking date. Common picks: today's date, the file's apparent created-date based on a header in the doc, or a date inferred from other dates mentioned (line item dates, payment terms phrasing). The fabrication is confident and never flagged as a guess. To see how confident it is, add "explain why" to the user message and watch the model justify a value it made up.
  5. Revert the schema. Put Optional[str] = None back. This isn't just cleanup — it's the lesson. Required fields should be exceptions, not defaults; "required" in JSON schema is a contract the model satisfies by any means, including invention.
  6. Document the side-by-side. Write down (or save to a file): document name, schema flavor (optional vs required), extracted value. Two lines for the same document, two different behaviors, no code change other than one Optional. This is the exam's core fabrication-under-required-fields demonstration.
  7. Verify the rule. You should be able to articulate, in one sentence: "Required fields force the model to invent when the source is silent; default to Optional/nullable and only mark a field required when you're certain it'll be in every document AND you'd rather see an error than a null." If you can quote that and point to your no_date_01.txt outputs as evidence, you've internalized this Domain 4 lesson. The exam's Pattern A questions ("the extraction is fabricating values — what change would most effectively address this?") map directly to making the offending field optional.
3
Validation-retry loop

On schema-validation failure, send a follow-up with the document, the failed extraction, and the specific validation error. Track which errors resolve via retry (format mismatches) and which don't (information genuinely absent).

  1. Why retry the model on validation errors. Two failure modes look identical from outside the model: (a) the model returned a string where a date was expected — a format mismatch the model can fix if you tell it the right format; and (b) the model returned None for a required field because the document genuinely lacks it — no amount of retrying will conjure the missing information. The retry loop lets you distinguish them empirically by seeing which errors resolve on a second pass.
  2. Implement retry_with_feedback in src/extract.py. The scaffold leaves this as a NotImplementedError TODO. Bound the budget tightly — 2 attempts is plenty; the model rarely succeeds on attempt 3 when it failed on 2. Catch ValidationError, build a feedback message, and re-send with the previous attempt and the specific error. (The scaffold uses record_invoice as the tool name; recipes call it extract_invoice — match whichever name your INVOICE_TOOL_SCHEMA uses.)
    from pydantic import ValidationError
    
    def retry_with_feedback(document_text: str, failed: Invoice, error_message: str) -> Invoice:
        messages = [
            {"role": "user", "content": document_text},
            {"role": "assistant", "content": (
                f"Previous extraction attempt:\n{failed.model_dump_json(indent=2)}\n\n"
                f"That extraction failed validation: {error_message}"
            )},
            {"role": "user", "content": (
                "Re-extract using the schema correctly. "
                "Use null for genuinely-absent fields; do not fabricate values."
            )},
        ]
        resp = client.messages.create(
            model=MODEL, max_tokens=2048, system=SYSTEM,
            tools=[INVOICE_TOOL_SCHEMA],
            tool_choice={"type": "tool", "name": "record_invoice"},
            messages=messages,
        )
        for block in resp.content:
            if block.type == "tool_use" and block.name == "record_invoice":
                return Invoice.model_validate(block.input)
        raise RuntimeError("retry did not produce a tool call")
    Then wrap your top-level extraction in a loop that catches ValidationError from extract() and calls retry_with_feedback. Limit to 2 attempts total.
  3. Write format_validation_error for the model. Pydantic's e.errors() returns a list of dicts. Turn each into one plain sentence the model can act on. Naming the field path and what was expected is enough.
    def format_validation_error(e: ValidationError, attempted: dict) -> str:
        lines = []
        for err in e.errors():
            loc = ".".join(str(x) for x in err["loc"])
            lines.append(
                f"- field '{loc}': {err['msg']} "
                f"(you returned: {attempted.get(err['loc'][0])!r})"
            )
        return "\n".join(lines)
  4. Plant a format-mismatch fixture. Take a clean invoice and change the date format to something the model often guesses wrong: "03/14/26" (which could be March 14 2026 or 3rd of April 1426 depending on parser). Save as fixtures/bad_date_format.txt. Run extract_with_retry. Expected behavior: attempt 1 fails with a Pydantic type_error.date; the feedback to the model includes "field 'invoice_date': invalid date format"; attempt 2 returns a correctly-typed date object. The model knew the date — it just emitted it in the wrong shape.
  5. Plant an absent-information fixture. Take a clean invoice and delete the date line entirely. Mark the schema field required (revert recipe 2's optional). Save as fixtures/no_date_required.txt. Run extract_with_retry. Expected: attempt 1 fails because the model returns None (or a fabrication that fails a stricter check). Attempt 2 produces the same kind of failure — the information isn't in the document. Restore the Optional after this test.
  6. Track outcomes in a small results table. Run both fixtures (and 8-10 more — mix clean, format-mismatch, and absent-info cases). Capture (filename, attempt_1_pass, attempt_2_pass, error_kind). Save as a CSV. The pattern that emerges:
    filename                  | a1 | a2 | error_kind
    clean_01.txt              | ✓  | -  | none
    bad_date_format.txt       | ✗  | ✓  | format_mismatch
    no_date_required.txt      | ✗  | ✗  | information_absent
    clean_02.txt              | ✓  | -  | none
    mixed_format.txt          | ✗  | ✓  | format_mismatch
  7. Verify the two recovery rates. Compute the percentage of format-mismatch errors that resolve on attempt 2 (typically 75-90%) and the percentage of information-absent errors that resolve (typically 0-15%, and the "successes" are usually fabrications, not real recoveries). The clear gap between those numbers is the lesson — retry isn't magic; it works on specific error classes. Quote both numbers. Domain 4 Pattern A ("validation keeps failing") almost always rewards "expand the schema (make the field optional) for genuinely-absent cases; tighten the format hint for format-mismatch cases" as the answer.
4
Add few-shot examples for varied formats

Show extraction from inline-citation papers, papers with bibliographies, papers with embedded methodology, papers with explicit methodology sections. Two to four targeted examples. Measure the empty-field rate before and after.

  1. Establish a baseline first. Without numbers to compare to, you can't tell whether few-shot examples helped or just felt better. Run your current extractor against 10 documents that span the formats you care about — make sure you've included at least one tabular invoice, one prose-format invoice, and one with a footer breakdown. Save the results.
    results = []
    for path in glob("fixtures/*.txt"):
        with open(path) as f:
            inv = extract(f.read())
        empties = [k for k, v in inv.model_dump().items() if v is None or v == []]
        results.append({"file": path, "empties": empties})
    empty_rate = sum(len(r["empties"]) for r in results) / (len(results) * total_fields)
    print(f"baseline empty-field rate: {empty_rate:.1%}")
    A baseline of 8-15% empties is typical for unaided extraction on mixed formats. If your rate is 0%, your fixtures are too clean to learn from — plant edge cases.
  2. Pick 2-4 examples that cover the formats most likely to confuse the model. The point of few-shot isn't to show the model "good" extractions in general — it's to demonstrate the formats where it fails. Look at which fixtures had the most empties in your baseline; pick one from each cluster. For invoices, the typical clusters are: tabular layouts (header row + line items), prose totals ("the total amount due is..."), and split breakdowns (line items at the top, totals in a footer section).
  3. Hand-write the correct extraction for each example. For each chosen fixture, manually produce the Invoice JSON you'd want the model to return. Sanity-check by feeding it through your model:
    example_1_output = {
        "vendor_name": "ABC Office Supplies",
        "invoice_date": "2026-04-12",
        "line_items": [
            {"description": "Stapler", "quantity": 2, "unit_price": 12.99},
            {"description": "Paper, 500 sheets", "quantity": 5, "unit_price": 8.50},
        ],
        "total": 68.48,
        "payment_terms": "net30",
    }
    # Sanity-check it parses cleanly:
    Invoice(**example_1_output)
    If Pydantic rejects your example, your hand-written JSON has the same kind of error you'd catch on the model. Fix it before showing it to the model as a "correct" answer.
  4. Add the examples to the prompt as labeled INPUT/OUTPUT pairs. The format matters — labeled pairs train the model's pattern matching far better than prose explanations. Add them to the system prompt or the user message, before the actual document:
    FEWSHOT_PROMPT = """
    Here are example invoices and their correct extractions.
    
    Example 1 (tabular layout):
    INPUT:
    {example_1_input}
    OUTPUT:
    {json.dumps(example_1_output)}
    
    Example 2 (prose-format totals):
    INPUT:
    {example_2_input}
    OUTPUT:
    {json.dumps(example_2_output)}
    
    Now extract this invoice using the same approach:
    """
    Two well-chosen examples typically beats six average ones.
  5. Re-run on the same 10 fixtures. Don't change anything else — same model, same schema, same documents. Just the system prompt is different.
  6. Compute the new empty-field rate. Compare to baseline. Expected: a meaningful drop — typically 30-60% reduction on the hard formats specifically. If the headline rate is roughly unchanged, the gains are probably concentrated in the formats you exemplified, with no effect on the others; that's still a win, just narrower than you might have hoped. If improvement is <5pp, your examples don't cover the actual failure modes — read which fields are still empty in the new run and pick more targeted examples.
  7. Verify the diminishing-returns curve. If you have time, add two more examples (now four total) and re-run. Then six. Plot empty-rate as you add examples. Expected shape: steep drop from 0→2 examples, modest from 2→4, near-flat 4→6+. That curve is the lesson — few-shot examples are dramatically cost-effective at 2-4, marginal beyond. Domain 4's "the extractor doesn't handle <format>" questions almost always reward "add 2-4 few-shot examples targeting that format" over more aggressive fine-tuning answers.
5
Self-correction validators

Have the extractor return both calculated_total and stated_total; flag discrepancies. Add a conflict_detected boolean for inconsistent source data.

  1. Why dual-totals. A single "total" field hides whether the model summed the line items or just copied the printed number. By asking for both — the stated total and an independently calculated one — you give your code a deterministic conflict-detection signal. The model can hallucinate one of these; it has a hard time hallucinating both consistently.
  2. Extend the schema with the dual-totals fields.
    class Invoice(BaseModel):
        ...
        stated_total: float                       # what the document claims
        calculated_total: float                   # sum of line_items computed by the model
        conflict_detected: bool = False
        conflict_details: Optional[str] = None
    Regenerate the JSON schema after this change. The model now knows it must produce two totals.
  3. Update the tool description to clarify what each field means. Without this, the model will often just compute one number and write it into both fields. Be explicit:
    "description": (
        "Extract fields from an invoice. "
        "stated_total = the total printed in the document (verbatim). "
        "calculated_total = the sum of line_items[].quantity * unit_price, "
        "computed by you INDEPENDENTLY of the stated total. "
        "These MUST be computed separately even if they happen to match. "
        "Do NOT copy stated_total into calculated_total."
    )
    The "INDEPENDENTLY" clause is the load-bearing word. Without it the model will shortcut.
  4. Implement the semantic validator in src/validate.py. The scaffold's Invoice schema exposes both stated_total (verbatim from the document) and calculated_total (model sums line items independently). Your validator just compares them — the dual-totals pattern means the model has to hallucinate both consistently to fool the check, which it rarely does. Tolerance of $0.05 accounts for floating-point and rounding noise; anything bigger is a real conflict.
    # src/validate.py
    EPSILON = 0.05
    
    def validate_totals_match(inv: Invoice) -> str | None:
        if inv.stated_total is None or inv.calculated_total is None:
            return None   # not enough data to check; not our job to flag
        diff = inv.stated_total - inv.calculated_total
        if abs(diff) > EPSILON:
            return (f"stated_total={inv.stated_total:.2f} differs from "
                    f"calculated_total={inv.calculated_total:.2f} (diff={diff:+.2f})")
        return None
    Notice the validator runs in Python, not in the model's head. That's the whole point — deterministic checks live in code. The scaffold's extract.py already calls validate_totals_match after extraction and prints FLAG: ... or OK; you just need to fill in the body. Re-run fixtures/invoice_007_bad_total.txt after implementing — you should see FLAG: stated_total=185.50 differs from calculated_total=175.50 (diff=+10.00) (exact numbers depend on the planted fixture).
  5. Use the planted bad-total fixture. fixtures/invoice_007_bad_total.txt deliberately has line items that don't sum to the printed total. (If you're building from scratch, take a clean invoice and tweak one number to introduce a $5-ish discrepancy.)
  6. Run extraction and inspect. python -m src.extract fixtures/invoice_007_bad_total.txt. Expected: the extraction succeeds (Pydantic validation passes — the schema doesn't catch math errors); then validate_totals_match prints the FLAG line comparing stated_total vs calculated_total. In your pipeline, this row routes to a "needs human review" queue, not to the happy-path output bucket. The scaffold also sets conflict_detected=True (with a populated conflict_details) when the model itself notices the contradiction during extraction — that's a second signal you can OR with the validator's flag.
  7. Verify on a clean fixture for the false-positive check. Run a known-good invoice. Expected: conflict_detected == False, the two totals are equal (within $0.01). If you get false positives — clean invoices flagged as conflicts — your model is doing math wrong or the tolerance is too tight. Bump the tolerance to $0.05 if needed, but a model that can't add four line items reliably is a different problem. Semantic validation runs in code, not in the model's head — that's the entire reason it works. Domain 4's Pattern A ("totals are sometimes wrong") and Pattern B ("output is wrong") both reward this exact pattern: dual signals + deterministic comparison + a conflict flag for downstream routing.
6
Batch processing

Submit 100 documents via the Message Batches API with a custom_id per request. Calculate the submission cadence to guarantee a 30-hour SLA given the 24-hour processing window. Resubmit any failures by custom_id, chunking ones that exceeded context limits.

  1. Don't open batch.py until the sync extractor is reliable. Most batch failures come from debugging two layers at once — schema bugs and batch bugs look identical from a CSV. Confirm extract.py handles 5/5 documents cleanly in the synchronous path first. The Message Batches API has no multi-turn tool calling and no retry loop, so all the work from recipes 1-5 must be solid before you batch anything.
  2. Build the batch request list. Each entry has a custom_id (your handle for resubmission) and params matching the shape of messages.create. Make custom_id deterministic from the document — typically a hash of the file path or the document ID — so resubmission can target specific failures.
    import hashlib
    def doc_id(path: str) -> str:
        return hashlib.sha256(path.encode()).hexdigest()[:12]
    
    requests = []
    for path in glob("fixtures/*.txt"):
        with open(path) as f:
            text = f.read()
        requests.append({
            "custom_id": doc_id(path),
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 2048,
                "tools": tools,
                "tool_choice": {"type": "tool", "name": "extract_invoice"},
                "messages": [{"role": "user", "content": f"Extract: {text}"}],
            },
        })
    If custom_id is non-deterministic (e.g., a UUID generated at submission time), you can't resubmit a specific failure later — you'd have to re-run everything.
  3. Submit the batch. The API returns a batch ID immediately; processing happens asynchronously up to 24 hours later.
    batch = client.messages.batches.create(requests=requests)
    print(f"submitted batch {batch.id} with {len(requests)} requests")
    Save the batch ID to disk — you'll need it to poll.
  4. Poll until completion. Production systems use webhooks; the lab can poll. Don't poll faster than every 30-60 seconds; the API will rate-limit aggressive pollers.
    import time
    while True:
        b = client.messages.batches.retrieve(batch.id)
        print(f"{b.processing_status}: {b.request_counts}")
        if b.processing_status == "ended":
            break
        time.sleep(60)
    Track request_counts.processing / succeeded / errored — they're your live progress bar.
  5. Stream results and validate each one. The results() generator yields one entry per custom_id. Map back to your documents and run each output through Pydantic.
    id_to_path = {doc_id(p): p for p in glob("fixtures/*.txt")}
    output = {}
    failures = []
    for result in client.messages.batches.results(batch.id):
        cid = result.custom_id
        path = id_to_path[cid]
        if result.result.type == "succeeded":
            msg = result.result.message
            tu = next(b for b in msg.content if b.type == "tool_use")
            try:
                output[path] = Invoice(**tu.input)
            except ValidationError as e:
                failures.append((path, cid, "validation", str(e)))
        else:
            failures.append((path, cid, result.result.type, "see batch"))
  6. Compute the SLA math explicitly. The batch API processing window is up to 24 hours. If your customer SLA is 30 hours end-to-end, you have a 6-hour buffer that must cover: submission queue + the 24-hour ceiling + failure analysis + chunk-and-retry on any context-length failures + final validation. Write this down — it's a Pattern G question on the exam, and the answer is usually "submit at most 24 hours before SLA deadline; reserve the rest for failure handling," not "submit and pray."
  7. Handle the two failure classes by custom_id. Walk your failures list and triage:
    • Context-length exceeded (long documents): chunk into 4-6 sections, give each its own custom_id like {orig_id}-chunk-{n}, submit a new batch with just the chunks, merge outputs in your code.
    • Transient API errors: resubmit unchanged with the same custom_id in a follow-up batch. Same input, different luck.
    • Validation errors: these usually mean a model output that didn't match the schema; you can't fix them in batch (no retry loop available), so flag for the sync retry path from recipe 3.
  8. Verify end-to-end. You should be able to: (a) submit 100 documents in one batch; (b) identify failures by custom_id from the results stream; (c) resubmit ONLY the failures in a follow-up batch, not the whole set; (d) produce a final dataset with 100/100 coverage. Quote the wall-clock time of the run — that's the number your SLA math is built on. Domain 4 Pattern G ("use Message Batches for both workflows?") rewards knowing that batches are right for high-volume, latency-tolerant extraction — and wrong for anything that needs retries or multi-turn tool calls.
7
Confidence calibration & stratified sampling

Output field-level confidence. Calibrate thresholds against a small labeled validation set. Stratify accuracy measurement by document type and field. Verify your "97% overall" isn't masking a 60% on one segment.

  1. Why model-emitted confidence beats nothing — but only barely. A confidence number from the model is not calibrated by default: a 0.9 from the model doesn't mean 90% accuracy in practice. You have to calibrate against ground truth. That's what makes this recipe long: most of it is labeling, not coding. Domain 5's most common exam trap is treating an uncalibrated confidence as if it were a probability — don't.
  2. Extend the schema with paired confidence fields. Add a {field}_confidence for each meaningful field. Don't add it to derived fields like calculated_total — confidence there is meaningless.
    class Invoice(BaseModel):
        ...
        vendor_name: str
        vendor_name_confidence: float        # 0.0 to 1.0
        invoice_date: Optional[date] = None
        invoice_date_confidence: float
        payment_terms: Literal[...]
        payment_terms_confidence: float
        ...
  3. Tell the model what the confidence number means. Without an anchored definition the model emits noise. Give it three concrete grounding points:
    "description": (
        "... existing description ... "
        "For each field, include {field}_confidence on a 0.0-1.0 scale: "
        "  1.0 = field is stated verbatim in the document. "
        "  0.5 = field is inferred from context but not explicitly stated. "
        "  0.2 = field is a guess based on conventions, not the source. "
        "  0.0 = field is fabricated. Prefer null + 0.0 over an invented value."
    )
    The "prefer null + 0.0" clause is the relief valve — without it the model will sometimes produce a fabricated value with high confidence rather than admit it doesn't know.
  4. Hand-label 20-30 documents as your validation set. This is the lab's slow part. Take a sample of fixtures spanning your formats. For each (field, document) pair, mark correct, incorrect, or unknown (you can't verify either way).
    # ground_truth.csv
    file,field,truth,is_correct
    fixtures/clean_01.txt,vendor_name,ABC Office,
    fixtures/clean_01.txt,invoice_date,2026-04-12,
    fixtures/clean_01.txt,total,68.48,
    ... 50-100 rows ...
    Do this once, carefully. Bad labels poison everything below. If a document is genuinely ambiguous, mark the field unknown and exclude it from accuracy metrics.
  5. Calibrate the threshold against ground truth. Walk confidence cutoffs from 0.1 to 0.9 in steps. At each cutoff, compute: (a) precision — among predictions with confidence ≥ cutoff, what fraction is correct? (b) recall — what fraction of all correct predictions do you keep? Plot the trade-off. Pick the threshold where precision crosses your tolerance (often 0.95). Below it: route to human review.
    for cutoff in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
        kept = predictions[predictions.confidence >= cutoff]
        precision = (kept.is_correct == True).mean()
        recall = kept.shape[0] / predictions.shape[0]
        print(f"cutoff={cutoff} precision={precision:.2f} recall={recall:.2f}")
  6. Stratify accuracy by document type AND field. A single "overall accuracy" number lies — it hides the worst segments. Build a small table cross-tabbing format × field. Common shape:
                    | vendor_name | invoice_date | total | payment_terms
    tabular         |  98%        | 92%          | 99%   | 84%
    prose-format    |  91%        | 76%          | 88%   | 51%
    footer-summary  |  96%        | 88%          | 94%   | 73%
  7. Find the worst stratum and act on it. Look at the table. The worst cell is your real problem — in the example above, payment_terms on prose-format invoices at 51% is the whole story. Don't celebrate the 96% headline; the headline mostly reflects easy formats. Domain 5's calibration questions reward "stratified analysis caught X bad segment that the aggregate hid" answers. The fix is targeted — better few-shot examples for prose-format payment terms, or routing prose-format invoices through a separate prompt — not retraining everything.
  8. Verify the muscle memory. You should be able to quote: (a) the calibrated confidence threshold, (b) the headline accuracy, and (c) at least one stratum that performs measurably worse. If everything is uniformly excellent, your validation set lacks variety — your real production data is messier than your fixtures. Plant more edge cases (worn prose layouts, multi-currency invoices, foreign-language vendors) and re-label. Stratification is only useful when the strata can fail differently.
What success looks like
  • python -m src.extract fixtures/invoice_00N.txt runs cleanly on at least 8 of the 10 ships-in-the-scaffold fixtures and returns a valid Invoice Pydantic model.
  • invoice_006_missing.txt: optional fields return null; flip vendor_name to required and re-run, watch the model fabricate a plausible value to satisfy the schema. That side-by-side is the lesson.
  • The validation-retry loop visibly resolves a format error on invoice_008_informal_date.txt (the model emits "03/14/26" on first try, ISO on retry). The same loop visibly fails to resolve when the source document genuinely lacks a required field, proving retry isn't magic.
  • invoice_007_bad_total.txt gets caught by your validate_totals_match and printed as FLAG: ... rather than passing through silently. invoice_010_conflict.txt additionally trips the model's own conflict_detected flag during extraction.
  • Your batch run on 100 documents (step 6) handles failures by custom_id: failed documents are identifiable, resubmittable, and you can show the SLA math (24-hour processing window vs your target).
  • Stratified accuracy measurement reveals at least one segment that performs measurably worse than the aggregate. If everything is uniformly excellent, you didn't plant enough variety in the corpus.

Optional Lab 5: A CI/CD review pipeline

≈ 1 hour · domains 3, 4

Goal: Wire Claude into a GitHub Actions workflow that reviews pull requests, posts structured findings as inline comments, and dedupes on follow-up commits. Optional — skip if you're tight on time — but the muscle memory makes the Scenario 5 exam questions feel familiar instead of theoretical.

Before you start
  • Finish Lab 4 first (or at least be comfortable with structured JSON output via --json-schema). The CI flow is mostly orchestration around the same primitive.
  • A GitHub repo you own. Don't experiment on a real team's CI. A throwaway repo with two or three source files and one open PR is plenty.
  • Enable Actions on the repo (Settings → Actions → General → Allow all actions).
  • Add your API key as a repo secret. Settings → Secrets and variables → Actions → New repository secret. Name it ANTHROPIC_API_KEY. Do not commit it to a file.
  • Local gh CLI authed (gh auth status green) — you'll use it to open the test PR and watch comments arrive without leaving the terminal.
  • Reference docs: the Claude Code guide sections on headless mode (-p) and --output-format json, plus GitHub's pulls/comments REST reference.
1
Define the findings schema

Keep it tight — five or six fields max. The schema is what makes Claude's review output addressable instead of a wall of prose.

  1. Why the schema comes first. Without a schema, a reviewer's output is paragraphs of prose that your code can't act on. With one, every finding is an object you can route, dedupe, count, gate on. Every subsequent recipe assumes this schema exists, so getting it right now saves time later. Keep the field list tight — five or six fields max. More fields = more model variance = more brittle pipeline.
  2. Create .claude/review-schema.json. The schema lives in the repo so the workflow can find it. Six fields cover essentially every code-review use case: file, line, severity, category, message, plus an optional suggestion for proposed fixes. Constrain severity with an enum — that's what makes the gating step (recipe 5) possible.
    cat > .claude/review-schema.json <<'EOF'
    {
      "type": "object",
      "properties": {
        "findings": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "file":       { "type": "string" },
              "line":       { "type": "integer", "minimum": 1 },
              "severity":   {
                "type": "string",
                "enum": ["blocker", "warning", "nit"]
              },
              "category":   { "type": "string" },
              "message":    { "type": "string" },
              "suggestion": { "type": ["string", "null"] }
            },
            "required": ["file", "line", "severity", "category", "message"]
          }
        }
      },
      "required": ["findings"]
    }
    EOF
    Notice findings is always an array (potentially empty). A clean PR returns {"findings": []}, not null. That distinction matters in recipe 3 — an empty array is "I looked and found nothing"; null would be "I didn't look."
  3. Commit the schema. The workflow file you'll write next references it by repo path, so this needs to be in the repo.
    git add .claude/review-schema.json
    git commit -m "add review schema"
    Don't push yet — the workflow file goes in the next recipe, and you want both in place before the first PR triggers anything.
  4. Sanity-check that the schema parses. Even small syntax errors here cause silent downstream failures. Run:
    jq . .claude/review-schema.json > /dev/null && echo "OK" || echo "broken"
    You want "OK". If you get "broken", jq will print the line and column of the parse error.
  5. Verify with a real JSON-schema validator. Parsing is necessary but not sufficient — you also want to know the schema itself is valid JSON Schema. Use one of:
    • npx --yes ajv-cli compile -s .claude/review-schema.json — should print "valid".
    • Any editor with JSON Schema linting (VS Code with the JSON extension picks it up automatically if you add "$schema" to the top).
    If the schema is silently broken, every downstream step appears to work for a while and then produces nonsense findings. Catch it here.
2
Write the workflow file

Triggers on pull_request, computes the diff, asks Claude for structured findings.

  1. Map the workflow shape. The workflow has four jobs in sequence: checkout the PR head with full history, compute the diff against the base branch, run Claude in headless mode with your schema, then post or gate based on output. This recipe covers the first three; recipes 3-5 add posting, dedup, and gating on top.
  2. Create .github/workflows/claude-review.yml. Triggers fire on PR open and on every subsequent push (synchronize). The permissions block is small — write to PR comments, read repo contents, nothing else.
    mkdir -p .github/workflows
    cat > .github/workflows/claude-review.yml <<'EOF'
    name: Claude Review
    on:
      pull_request:
        types: [opened, synchronize]
    
    permissions:
      pull-requests: write
      contents: read
    
    jobs:
      review:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
            with:
              fetch-depth: 0   # full history, needed for the diff below
    
          - name: Compute diff against base
            run: |
              git diff origin/${{ github.base_ref }}...HEAD > diff.txt
              wc -l diff.txt
    
          - name: Install Claude Code
            run: npm i -g @anthropic-ai/claude-code
    
          - name: Run Claude in headless mode
            env:
              ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
            run: |
              claude -p --output-format json \
                --json-schema .claude/review-schema.json \
                < diff.txt > findings.json
    
          - name: Show findings (debug)
            run: cat findings.json
    EOF
  3. Why fetch-depth: 0 is non-negotiable. The default actions/checkout@v4 only fetches the last commit. Without full history you can't run git diff origin/main...HEAD because origin/main isn't in the local tree. The fetch-depth: 0 pulls the full history. Add it to checkout and don't think about it again.
  4. Why git diff base...HEAD uses three dots. Three dots means "everything reachable from HEAD that's not reachable from base" — i.e., the actual PR diff. Two dots (base..HEAD) shows commits in HEAD not in base, which is similar but includes changes from base merged in. For code review, three-dot is what you want.
  5. The cat findings.json step is a deliberate diagnostic. Leave it in for the first few runs so you can read the structured output in the Actions log. Remove it once the pipeline is stable; otherwise you're logging review findings publicly in the workflow log.
  6. Commit and push. Don't open a PR yet. You want the workflow merged to main first so it's available when a PR triggers it; if you put the workflow inside a PR's branch, the version that runs is the version on main, not the version in the PR — so your first run will silently use an older (or missing) workflow.
    git add .github/workflows/claude-review.yml
    git commit -m "add Claude review workflow"
    git push origin main
  7. Verify the workflow is installed. Open the repo's Actions tab on GitHub. Expected: "Claude Review" appears in the left sidebar with "This workflow has no runs yet." If you see a YAML parse error banner, fix it immediately — GitHub doesn't run broken workflows but also doesn't loudly tell you they're broken once they're in main. If the secret isn't set, you won't know until the first PR fires and the run fails with ANTHROPIC_API_KEY: unset.
3
Post findings as inline review comments

Inline comments anchored to lines are dramatically more useful than a single bulk "review" comment; the exam's Scenario 5 hinges on you knowing why.

  1. Why inline beats bulk. A single bulk comment containing "12 findings, see list" forces every reviewer to read the list, find each location, scroll to it, mentally cross-reference back. Inline comments live where the code lives — readers see them in the diff, can react in context, and the conversation thread per finding stays separate. The exam's Pattern A questions about "how to make AI review more actionable" reliably reward inline comments + structured findings as the answer.
  2. Append a posting step to the workflow. The step reads findings.json, iterates, and posts each finding via the GitHub REST API. gh api wraps the REST calls; jq walks the JSON.
          - name: Post findings as inline comments
            env:
              GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
            run: |
              COMMIT_SHA=$(git rev-parse HEAD)
              jq -c '.findings[]' findings.json | while read finding; do
                FILE=$(echo "$finding" | jq -r .file)
                LINE=$(echo "$finding" | jq -r .line)
                SEV=$(echo "$finding" | jq -r .severity)
                CAT=$(echo "$finding" | jq -r .category)
                MSG=$(echo "$finding" | jq -r .message)
                SUG=$(echo "$finding" | jq -r '.suggestion // ""')
    
                BODY="**$SEV** ($CAT): $MSG"
                if [ -n "$SUG" ]; then
                  BODY="$BODY"$'\n\n'"_Suggestion:_ $SUG"
                fi
    
                gh api \
                  repos/${{ github.repository }}/pulls/${{ github.event.number }}/comments \
                  --method POST \
                  -f commit_id="$COMMIT_SHA" \
                  -f path="$FILE" \
                  -f line="$LINE" \
                  -f side=RIGHT \
                  -f body="$BODY"
              done
    Notable choices: side: RIGHT means "anchor on the new code side of the diff" (where the finding actually lives); commit_id must be the PR HEAD SHA, not main; jq -c outputs one JSON object per line for the shell loop.
  3. Commit and push the workflow update. Merge to main as before — the workflow in main is what fires on PRs.
    git add .github/workflows/claude-review.yml
    git commit -m "post findings inline"
    git push origin main
  4. Create a test PR to fire the workflow. Make a tiny branch, plant 2-3 issues you'd expect Claude to flag (a logged secret, an unused import, a misnamed variable), open the PR with gh:
    git checkout -b lab5-test-pr
    echo 'console.log("DEBUG_TOKEN=" + process.env.DEBUG_TOKEN);' > demo/leaky.js
    git add . && git commit -m "test: planted blocker"
    git push -u origin lab5-test-pr
    gh pr create --fill
    Watch the Actions tab; the run should start within a few seconds.
  5. Inspect the PR's Files Changed tab. Expected: each finding appears as a separate inline comment on its line, with the severity tag in bold and the category in parentheses. If the suggestion field was present, you see a "_Suggestion:_ ..." line underneath. The PR conversation looks like a senior engineer left targeted notes, not a wall of text.
  6. Most-common failure modes — and how to spot them.
    • Comments don't appear at all → check the workflow log. The most-likely message is HTTP 422: pull_request_review_thread.line must be part of the diff. Cause: Claude returned a line number that's not in the actual diff. Either the diff format the model received was wrong, or your tool description didn't constrain output to lines present in the diff. Fix the prompt.
    • Comments appear at the wrong lines → your commit_id is wrong. It MUST be the PR's HEAD SHA (computed inside the action with git rev-parse HEAD), not main.
    • Some comments fail, others succeed → file path doesn't match what GitHub stores. Paths should be repo-relative, no leading slash, and case-sensitive.
  7. Verify by comparing to a no-inline baseline. Temporarily change the posting loop to post a single bulk comment via gh pr comment with all findings concatenated. Re-run on the same PR. Read both versions: the inline version shows up where the eye is already looking; the bulk version is a wall of "see line 42 in src/x" cross-references. The difference is the entire reason structured findings + inline anchoring is the production-grade pattern. Restore inline before continuing.
4
Dedupe on follow-up commits

Naïvely, every workflow run re-posts every finding. Fix that: before posting, look up what's already commented and only post new findings.

  1. Why naïve posting is a disaster. Every push to a PR triggers synchronize, which re-runs the workflow, which re-posts every finding — including ones already on the PR. After three pushes, a PR with 5 findings has 15 comments, most duplicates. Reviewers stop reading. The fix is dedup keyed on (file, line, finding fingerprint).
  2. Fetch existing comments before the posting loop. Insert this step before the post loop in the workflow:
          - name: Fetch existing review comments
            env:
              GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
            run: |
              gh api \
                repos/${{ github.repository }}/pulls/${{ github.event.number }}/comments \
                --paginate \
                --jq '[.[] | {path: .path, line: .line, body: .body}]' \
                > existing.json
              echo "Existing comments: $(jq length existing.json)"
    The --paginate flag matters — on busy PRs the comment list spans multiple pages. Without it you only see the first 30.
  3. Pick a dedup key. The simplest robust key combines (path, line, first-80-chars-of-body). Two findings at the same line are rare in practice; if your domain has them, include the category to disambiguate. The 80-char prefix tolerates trivial wording changes — "Variable is unused" vs "Variable foo is unused" — without losing the match.
  4. Add the dedup check to the posting loop. Modify the loop body so the post call is skipped when the key matches an existing comment.
    # inside the posting loop, before the gh api POST:
    KEY_BODY=$(echo "$BODY" | head -c 80)
    DUP=$(jq --arg p "$FILE" --arg l "$LINE" --arg b "$KEY_BODY" \
      '[.[] | select(.path == $p and (.line|tostring) == $l and (.body|.[0:80]) == $b)] | length' \
      existing.json)
    if [ "$DUP" -gt 0 ]; then
      echo "skip duplicate: $FILE:$LINE"
      continue
    fi
  5. Test on the existing PR. The PR from recipe 3 already has comments. Run the workflow again on it (push an empty commit: git commit --allow-empty -m "retrigger" && git push). Expected: the new run posts zero comments because every finding's key already exists. Check the Actions log — you should see "skip duplicate" lines for each finding.
  6. Now test the regression case. Push a follow-up commit that fixes ONE planted issue (e.g., remove the leaky console.log) and adds ONE new issue (e.g., an unused variable). Expected: only the new finding posts as a comment; the surviving findings stay quiet; the fixed finding doesn't reappear. The PR's comment count grew by exactly 1, not by 4.
  7. Verify with the comment-count math. If you started with 4 findings → 4 inline comments. After push 1 (no changes), still 4. After push 2 (fix one, add one), 5. After push 3 (fix one more), 5 still (the fixed comment stays in the conversation history; no new post needed). If your count grows faster than that, the dedup key is wrong — either too strict (e.g., including a timestamp), too loose (matching unrelated comments), or you're not fetching existing comments before the loop. The dedup pattern is what makes "AI in the review loop" survive longer than one push. Without it, the noise drowns the signal by the third commit.
5
Gate the merge on severity

The combination of structured output + threshold-based gating is the production-ready shape of "AI in the review loop." This is the part the exam is really testing.

  1. Why a hard gate matters. Inline comments are advisory by default — reviewers can ignore them. Merging is reversible but expensive. A hard gate on severity == "blocker" is the production move: the workflow exits non-zero, GitHub marks the check failed, branch protection refuses to enable the Merge button. The model can't override the gate; only a human can downgrade the severity (and that decision lives in the PR conversation, audit-trail-friendly).
  2. Add the gating step at the end of the workflow. Count blockers. If >0, fail the run.
          - name: Fail on blockers
            run: |
              BLOCKER_COUNT=$(jq '[.findings[] | select(.severity == "blocker")] | length' findings.json)
              WARN_COUNT=$(jq '[.findings[] | select(.severity == "warning")] | length' findings.json)
              NIT_COUNT=$(jq '[.findings[] | select(.severity == "nit")] | length' findings.json)
              echo "blockers: $BLOCKER_COUNT  warnings: $WARN_COUNT  nits: $NIT_COUNT"
    
              if [ "$BLOCKER_COUNT" -gt 0 ]; then
                echo "::error::$BLOCKER_COUNT blocker(s) found — merge gated"
                exit 1
              fi
    The ::error:: prefix is a GitHub Actions special string — it surfaces the message as a red banner at the top of the check, not just as a log line.
  3. Mark the workflow as a required status check. In repo Settings → Branches → Branch protection rules, edit (or create) the rule for main. Add Claude Review under "Require status checks to pass before merging." Save. Now any PR that reports a failing Claude Review check has the Merge button disabled until the check turns green.
  4. Plant a blocker on a test PR. A logged secret is a clean choice — your prompt should tell the model that any code logging a secret-shaped string is a blocker. Push:
    git checkout -b lab5-blocker-test
    echo 'console.log("API_KEY=" + process.env.API_KEY);' > demo/leak.js
    git add . && git commit -m "test: blocker"
    git push -u origin lab5-blocker-test
    gh pr create --fill
    Within a minute the workflow should fire. Expected: the run fails, the check shows red, the PR page shows "Required status check is failing" with the Merge button disabled.
  5. Push the fix and watch the check recover. Edit demo/leak.js to remove the leak. Commit, push. The workflow re-runs; this time BLOCKER_COUNT is 0; the run exits 0; the check turns green; the Merge button enables. This is the production-grade shape: the gate is an automatable threshold, not a human gut call, and the recovery is visible in real time.
  6. Test all four states explicitly. The full matrix:
    • Clean PR → green check, no comments posted, mergeable.
    • Nit-only PR → green check, inline nits posted, mergeable.
    • Warning-only PR → green check, warnings posted, mergeable (warnings are advisory).
    • Blocker PR → red check, blockers posted, merge gated.
    Run one PR through each state. Take screenshots if you want a portfolio piece.
  7. Verify by quoting the exam-relevant pattern. You should be able to summarize, in two sentences: "Structured findings + an enum severity + a workflow that exits non-zero on the blocker tier + branch protection on that workflow is what makes AI review safe in production. Every other shape — bulk comments, free-form prose, advisory-only — fails open." The Scenario 5 exam questions about CI/CD review pipelines almost always boil down to this. If you can sketch the four-state matrix from memory, you've got it.
What success looks like
  • Opening a PR triggers the workflow and inline comments appear on specific lines within a minute or two — not as a single bulk comment at the top.
  • Pushing a follow-up commit that fixes one finding does not re-post the other findings. The PR conversation stays readable.
  • A PR with a planted blocker-severity finding shows a failed check, and the merge button is disabled until the blocker is fixed. Fix and re-push; check goes green.
  • Running the workflow on a PR with no real issues produces an empty findings.json and posts nothing. False-positive rate is something you can see, not just hope for.
  • You can describe, in one paragraph, why structured JSON + a schema beats free-form review prose for this use case. (Hint: every Scenario 5 question is some flavor of "the AI is producing output that's hard to act on — what would most effectively address this?")
Part 10

What is NOT on the exam

The exam guide is explicit about what's out of scope. Don't waste prep time on these:

  • Fine-tuning, training, or model weights
  • Authentication, billing, account management, OAuth flows, API key rotation
  • Deploying or hosting MCP servers (infrastructure, networking, container orchestration)
  • Constitutional AI, RLHF, safety-training methodology
  • Embedding models or vector database internals
  • Computer use (browser automation, desktop interaction)
  • Vision and image analysis
  • Streaming API implementation, server-sent events
  • Rate limiting, quotas, pricing math
  • Specific cloud provider configs (AWS, GCP, Azure)
  • Performance benchmarking, model comparison metrics
  • Prompt caching internals (beyond knowing it exists)
  • Token counting algorithms or tokenization specifics

This list is generous. If a prep resource is heavy on any of the above, it isn't aligned with this exam.

Part 11

Day-of strategy

Reading the question

  • The scenario is one of six. Recognize it in the first sentence. Then go to the question.
  • Identify which domain is being tested. Each scenario maps to a primary set of domains; the question will be in one of them.
  • Look for the "production data shows…" or "logs show…" framing. The setup usually contains the root cause if you read carefully. The right answer addresses what the logs say happened, not what you'd guess.
  • Ask the five judgments from the opening section: deterministic-when-required, model-driven-decisions, proportionate-first-response, isolated-subagent-context, structure-beats-summary.

Reading the answers

  • Eliminate the always-wrong distractors first. Sentiment routing, self-reported confidence as a routing signal, "use a bigger model," "suppress the error and return success," "terminate the workflow." If any of those appear, scratch them.
  • Eliminate over-engineered answers. Separate classifier model, ML pipeline, custom routing layer: these are usually wrong when a smaller mechanism would do. Unless the question explicitly says "we've tried the simple thing and it didn't work."
  • Eliminate "addresses the wrong layer" answers. If the problem is tool ordering, an answer about tool availability is wrong. If the problem is decomposition, an answer about the synthesis agent is wrong.
  • Of the remaining one or two, pick the smallest mechanism that fixes the root cause.

Time and pacing

  • There's no penalty for guessing. Answer everything. An unanswered question is scored as incorrect.
  • If a question is taking more than two minutes, mark and move on. Come back with fresh eyes.
  • Trust your first read on the pattern questions (Patterns A–G). Second-guessing usually moves you toward over-engineered distractors.

The week before

  • Re-read the six scenarios until you can recall each one cold.
  • Re-read the sample questions and explanations. The explanations teach the test's grammar.
  • Run through your Lab 1 and Lab 2 code one more time. The pattern questions become trivial when the code is fresh.
  • Take the practice exam if you have access. It's the highest-leverage hour you can spend.
Part 12

Closing

This exam tests one thing more than any other: do you have the disposition of someone who has shipped agentic systems to production? The candidate who has lived through a misrouted refund, a coordinator that decomposed too narrowly, or an extraction schema that hallucinated to satisfy a required field: that candidate already knows most of the answers.

The exam is generous to the practitioner and unforgiving to the memorizer. Build the labs. The questions will look like things you've already debugged. — TWD

If you internalize the five judgments, recognize the seven question patterns, and complete the four labs, the rest is reading the question carefully and matching the answer to the smallest mechanism that fixes the root cause. That's the test. Good luck.

Companion reading on the same site: Best Practices covers the model-agnostic disposition this exam tests. Claude Code covers the tool you'll use in Domain 3.