№ CertLearn With Darin · Study companion

Claude Certified Architect Foundations.

A study companion for Anthropic's first solution-architect exam. The concepts you have to truly understand, the patterns the questions reward, and a hands-on lab plan so you sit the exam with the muscle memory of someone who has actually built the thing.

Exam version 0.1 · Feb 2025 ~30 min read Plus a 10-hour lab plan

Format: Multiple choice, 4 options
Structure: 4 of 6 scenarios per form
Scoring: Scaled 100–1,000
Pass: 720
Penalty for guessing: None; answer everything
Target candidate: 6+ months hands-on

D1 · Agentic Architecture & Orchestration27%
D3 · Claude Code Configuration & Workflows20%
D4 · Prompt Engineering & Structured Output20%
D2 · Tool Design & MCP Integration18%
D5 · Context Management & Reliability15%

Part 01

How this exam thinks

Before any domain content, internalize the disposition the exam is testing. Almost every "best answer" question rewards the same handful of judgments. If you can recognize them, you can answer questions on topics you only half-remember.

The five judgments the exam keeps testing

Deterministic over probabilistic when business logic depends on it. If a rule must hold (verify identity before refund, block refunds over $500, force a tool to run first), the right answer is a programmatic mechanism: a hook, a prerequisite gate, a forced tool_choice. The wrong answers are "improve the system prompt" and "add few-shot examples." Prompt-based compliance is probabilistic; financial and identity workflows can't tolerate that.
Model-driven decisions over hand-rolled control flow. Loops should branch on stop_reason. Coordinators should decide which subagents to invoke. Don't parse natural-language signals to decide "is the agent done." Don't pre-route requests with a regex classifier when a tool description fix gets you 80% of the way there.
Match the proportionate first response. When the question describes a problem and offers four fixes, the right answer is usually the smallest mechanism that addresses the root cause. Tool descriptions are minimal? Fix the descriptions. Escalation is wrong? Add explicit criteria with few-shot examples. Don't reach for a separate classifier model, ML pipeline, or higher-tier model when the cheap fix hasn't been tried.
Subagents have isolated context. Always. Subagents do not inherit the coordinator's history. The coordinator must pass complete findings, structured metadata, and quality criteria explicitly in the subagent's prompt. Whenever an answer hinges on "the subagent will know X," verify X was passed in.
Structure beats summary. When in doubt, the right answer is "preserve structured data": claim-source mappings, error categories with retryable flags, dates and provenance, conflicting values both annotated. The wrong answer is "summarize and pick one" or "return generic status."

Read every question through these five lenses before considering content knowledge. You will recognize the right answer faster than you can reason it from first principles.

The anti-patterns that are always wrong

The exam is generous: a handful of distractors are designed to be obviously wrong once you know the pattern. Burn these into memory:

Self-reported model confidence as a routing signal. The model is already overconfident on the cases it's getting wrong. Confidence scores from the model itself are not calibrated. (You can use field-level confidence calibrated against a labeled validation set; that's different.)
Sentiment analysis as a complexity signal. A frustrated customer with a simple problem and a calm customer with a complex one both exist. Sentiment doesn't predict case complexity.
"A bigger model / larger context window will fix it." Almost always wrong. Attention dilution, decomposition errors, and unclear criteria don't yield to more tokens.
"Suppress the error and return success." Catching a timeout and returning empty results as a successful query is always wrong. The coordinator loses the context it needs to recover.
"Terminate the entire workflow on any failure." Equally wrong from the other direction. Local recovery in subagents, structured propagation to the coordinator, partial results with coverage annotations.
Iteration caps and natural-language loop termination. An agentic loop terminates on stop_reason == "end_turn". Not when the assistant says "Done!" Not after N iterations as the primary stopping mechanism.

Part 02

The six scenarios

Every form draws four scenarios at random from the same six. Knowing them in advance is a real edge: when a question lands, you already have the system in your head and can focus on the specific judgment being tested.

Memorize these six setups

Customer support resolution agent. Agent SDK. MCP tools: get_customer, lookup_order, process_refund, escalate_to_human. Target: 80%+ first-contact resolution. Domains: D1, D2, D5.
Code generation with Claude Code. Custom slash commands, CLAUDE.md, plan mode vs direct execution. Domains: D3, D5.
Multi-agent research system. Coordinator delegates to web search, document analysis, synthesis, and report generator subagents. Domains: D1, D2, D5.
Developer productivity tools. Built-in Read/Write/Bash/Grep/Glob plus MCP. Domains: D2, D3, D1.
Claude Code in CI/CD. Automated reviews, test generation, PR feedback. Non-interactive flag, JSON output. Domains: D3, D4.
Structured data extraction. JSON schemas, validation, edge cases. Domains: D4, D5.

When a scenario opens, read the setup once and then go straight to the question. The setup is rarely a trick; the question is where the points live.

Part 03

Domain 1 · Agentic architecture & orchestration 27%

The biggest single block of the exam. Seven task statements, all about how an agent loop runs and how multi-agent systems coordinate. If you only have time to deeply prepare for one domain, prepare for this one.

The agentic loop, in one paragraph

You send a request to Claude. The response has a stop_reason. If it's "tool_use", you execute the requested tool, append the result to the conversation history, and send the next request. If it's "end_turn", you stop. That's the whole loop. The model decides when it's done, not your code.

Warn Iteration caps are a safety net, not a primary termination mechanism. Parsing assistant text for "Done!" is an anti-pattern. Tool results live in the conversation history so the model can reason about them on the next iteration.

Coordinator–subagent: the rules you must know

Hub-and-spoke architecture. All inter-subagent communication routes through the coordinator. This isn't aesthetic; it's where observability and consistent error handling live.
Subagents have isolated context. They do not inherit the coordinator's conversation history. The coordinator must pass complete findings (including structured metadata like source URLs, document names, and dates) directly in the subagent's prompt.
Spawn parallel subagents in a single response. The coordinator emits multiple Task tool calls in one turn, not across separate turns. This is how you get the latency win.
allowedTools must include "Task" for a coordinator to spawn subagents. If the question describes "the coordinator can't delegate," check this first.
Coordinator decomposition is the most common failure mode. If subagents complete successfully but the final report misses topic areas, the coordinator's task decomposition was too narrow. Don't blame downstream agents.
Coordinator prompts specify goals and quality criteria, not procedural steps. "Find comprehensive coverage of X" beats "First do A, then do B, then do C." Subagents need room to adapt.

Hooks vs prompts: the deterministic line

This is the single most-tested distinction in Domain 1. Memorize it cold:

Hooks Programmatic

For deterministic guarantees

Identity verification before financial action
Blocking refunds above a policy threshold
Forcing extract_metadata to run before enrichment
Normalizing heterogeneous data formats via PostToolUse
Anything where a non-zero failure rate is unacceptable

Prompts Probabilistic

For guidance and tone

Style, voice, helpfulness
Decomposition strategy
Soft preferences ("prefer concise responses")
Escalation criteria with few-shot examples
Anything where best-effort compliance is enough

Session management: resume vs fork vs fresh start

--resume <session-name> picks up a named session. Use when prior context is mostly still valid and you're continuing the same line of work.
fork_session creates an independent branch from a shared baseline. Use to explore divergent approaches (testing strategy A vs B, refactor approach 1 vs 2) from the same analysis.
Fresh start with an injected summary beats resuming when prior tool results are stale. After significant code modifications, telling a resumed session "by the way, file X changed in these specific ways" is better than expecting it to figure that out.

Decomposition strategy

Prompt chaining (fixed sequential pipeline) for predictable multi-aspect work. Example: per-file local review pass plus a cross-file integration pass.
Dynamic adaptive decomposition for open-ended investigation. Example: "add comprehensive tests to a legacy codebase" should map structure first, then prioritize, then expand subtasks as dependencies surface.
Multi-concern messages get decomposed in parallel. A customer with three issues becomes three concurrent investigations sharing context, not three sequential conversations.

Part 04

Domain 2 · Tool design & MCP integration 18%

Five task statements. The headline lesson: tool descriptions are the primary mechanism the model uses for tool selection. When two tools have minimal, similar descriptions, the model chooses badly. Half of Domain 2's right answers begin with "improve the descriptions."

What a good tool description contains

Purpose: what the tool does, in one sentence that distinguishes it from neighbors
Input format, including examples of valid inputs
Example queries that should call this tool
Edge cases and boundaries: when to use this tool vs a similar alternative
What the output looks like, so the model knows what to do with the result

If the question describes "agent picks the wrong tool" and the answer choices include "expand descriptions to include input formats, example queries, edge cases, and boundaries," that's the answer. Few-shot examples and routing classifiers are over-engineered for this problem.

Splitting vs consolidating tools

A generic analyze_document with overloaded behavior is a smell. Split it into extract_data_points, summarize_content, and verify_claim_against_source. Each gets a defined input/output contract. The exam consistently rewards splitting.

The opposite move (consolidating two tools into one polymorphic lookup_entity) is sometimes architecturally valid but is rarely the correct "first step" answer. The proportionate-response rule applies.

Structured error responses

Every MCP tool failure should return:

isError: true (the MCP flag)
errorCategory: transient / validation / business / permission
isRetryable: boolean
Human-readable description for the agent (and end-user via the agent)

Generic "Operation failed" responses are always wrong. They prevent the agent from making appropriate recovery decisions. A business-rule violation (refund exceeds policy) should be marked non-retryable with a customer-friendly explanation; a transient timeout should be marked retryable.

Common trap: the difference between an access failure (timeout, needs a retry decision) and a valid empty result (the query succeeded; there were just no matches). These need different return shapes. Conflating them is an exam-favorite distractor.

Tool distribution across agents

Tool count matters. Giving an agent 18 tools when 4–5 would do degrades selection reliability. Scope each agent's toolset to its role.
Cross-role tools are okay for high-frequency needs. A synthesis agent that constantly needs simple fact-checks gets a scoped verify_fact tool while complex verifications still route through the coordinator.
Replace generic with constrained. fetch_url becomes load_document with URL validation; the constrained version prevents misuse.

`tool_choice`: three modes, one common mistake

"auto": model chooses whether to call a tool or return text. The default.
"any": model must call some tool but picks which. Use to guarantee structured output when multiple extraction schemas exist.
{"type": "tool", "name": "..."}: forced selection of a specific tool. Use to ensure a particular extraction (like extract_metadata) runs before downstream enrichment.

The trap: people reach for forced selection when "any" would do. Forced selection is for ordering; "any" is for guaranteeing a structured response.

MCP server scoping

Project-scoped .mcp.json: shared team tooling, version-controlled, with ${ENV_VAR} expansion for credentials.
User-scoped ~/.claude.json: personal experiments that don't belong in the team config.
Both load simultaneously; their tools are all available to the agent at once.
MCP resources (not tools) are for content catalogs (issue summaries, schemas, doc hierarchies) that reduce exploratory tool calls.

Built-in Claude Code tools

Know what each one is for:

Grep: content search. Finding callers, error strings, imports.
Glob: path pattern matching. **/*.test.tsx, file-by-extension.
Read: load full file contents.
Write: full file write. The fallback when Edit can't find unique anchor text.
Edit: targeted modification by unique text match. Fast but fails on ambiguous anchors.
Bash: shell execution.

The exam-relevant move: build understanding incrementally with Grep first, then follow imports with Read. Don't read every file upfront.

Part 05

Domain 3 · Claude Code configuration & workflows 20%

Six task statements. The headline lesson: configuration hierarchy and scoping decisions are the difference between "works on my machine" and "works for the team."

The configuration hierarchy

Location	Scope	Shared via git?	Use for
`~/.claude/CLAUDE.md`	User	No	Personal preferences only
`.claude/CLAUDE.md` or root `CLAUDE.md`	Project	Yes	Team coding standards
Subdirectory `CLAUDE.md`	Directory	Yes	Area-specific conventions
`.claude/rules/*.md` with `paths:` frontmatter	Glob-pattern	Yes	Conventions that span directories (e.g. all `*/.test.*`)

The exam pattern: a question describes a "team member doesn't have the rules" symptom. The answer is "they're in user-scope; move to project-scope." Or: "test files are spread throughout the codebase and need consistent conventions." The answer is path-scoped rules with glob patterns, not a CLAUDE.md per directory.

Slash commands vs skills

Slash commands in .claude/commands/ (project) or ~/.claude/commands/ (user). For team-wide, version-controlled commands, project scope.
Skills in .claude/skills//SKILL.md with frontmatter:
- context: fork: runs in an isolated sub-agent context, output doesn't pollute main conversation. Use for verbose discovery or brainstorming.
- allowed-tools: restrict tool access during the skill (e.g., disallow destructive Bash).
- argument-hint: prompt the developer for required parameters when the skill is invoked without args.
Skills vs CLAUDE.md. Skills are on-demand, task-specific. CLAUDE.md is always-loaded universal standards. Don't put skills' contents in CLAUDE.md, and don't put always-on rules in a skill.

Plan mode vs direct execution

Plan mode

Architectural decisions
Multi-file refactors (45+ files)
Library migrations
Multiple valid approaches
Open-ended investigation

Direct execution

Single-file bug fix with a clear stack trace
Adding one validation conditional
Well-understood scope
One reasonable implementation

The Explore subagent isolates verbose discovery and returns summaries to preserve main context. Use it when an exploration phase will dump pages of file contents the main agent doesn't need to keep around.

Iterative refinement: when to do what

Concrete input/output examples. The most effective fix when prose descriptions are interpreted inconsistently. Two or three example pairs.
Test-driven iteration. Write the test suite first, then iterate by sharing test failures.
Interview pattern. Have Claude ask clarifying questions before implementing in unfamiliar domains. Surfaces considerations you'd miss.
Single message vs sequential. When fixes interact (e.g., changing one schema affects three call-sites), put all issues in one message. When they're independent, sequential is fine.

Claude Code in CI/CD

-p (or --print): non-interactive mode. The fix when the CI job hangs waiting for input. Don't reach for CLAUDE_HEADLESS=true or --batch; those are not real flags.
--output-format json --json-schema: structured findings parseable for automated PR comments.
Independent review instances beat self-review. The same session that generated code is biased toward defending its decisions. A second instance without the generator's reasoning context catches more.
Per-file passes plus a cross-file integration pass. This is the fix for "single-pass review of 14 files gives shallow, inconsistent feedback." Attention dilution is the root cause; bigger context windows don't help.
Including prior review findings when re-running on new commits prevents duplicate comments. Instruct Claude to report only new or still-unaddressed issues.
CLAUDE.md is how CI gets project context: testing standards, fixture conventions, valuable-test criteria. Without it, generated tests are shallow.

Part 06

Domain 4 · Prompt engineering & structured output 20%

Six task statements. The headline lesson: specific criteria and few-shot examples beat vague guidance every time, and tool_use with JSON schemas is the only reliable path to structured output.

Explicit criteria over vague instructions

"Be conservative" and "only report high-confidence findings" do nothing for precision. The right shape is explicit categorical criteria with concrete code examples for each severity level. "Flag a comment only when the claimed behavior contradicts the actual code behavior" is the form the exam rewards.

Corollary: when false-positive rates are high in one category, you can temporarily disable that category to restore developer trust while you improve the prompt for it. Don't let one noisy category undermine confidence in the accurate ones.

Few-shot examples: when and how many

2–4 targeted examples for ambiguous scenarios (not 10+).
Show the reasoning for why one action was chosen over plausible alternatives. Don't just show input → output; show the judgment.
Include examples that distinguish acceptable patterns from genuine issues. This is how you reduce false positives while keeping generalization.
Examples are most effective for: ambiguous tool selection, format consistency, varied document structures (inline citations vs bibliographies), and edge cases like null fields.

`tool_use` with JSON schemas: the only reliable path

Schema-compliant structured output requires tool_use. Free-form "respond in JSON format" prompts produce syntax errors. tool_use eliminates them.

Strict schemas eliminate syntax errors but not semantic errors (line items that don't sum to the total, values placed in the wrong field). Build separate validation for the semantic layer.
Required vs optional (nullable) fields matter. If a source document may not contain a piece of information, mark the field nullable. A required field tells the model to fabricate a value to satisfy the schema.
Enum + "other" + detail string is the right pattern for extensible categorization. Adding "unclear" as an enum value is the right pattern for ambiguous cases.
Format normalization rules live in the prompt alongside the strict schema. The schema doesn't normalize; the prompt does.

Validation, retry, feedback loops

Retry with the specific validation error appended to the prompt. The model corrects format/structural errors well when told what failed.
Retries don't help when the information is absent from the source. If a required field can't be filled because the source doc didn't include it, no amount of retrying creates the data. Mark the field nullable instead.
detected_pattern field. A structured-output trick. Tag each finding with the code construct that triggered it, so when developers dismiss findings you can analyze the dismissal patterns systematically.
Self-correction validators. Extract calculated_total alongside stated_total to flag discrepancies. Add conflict_detected booleans to mark inconsistent source data.

Batch processing: when it's right and when it's wrong

Workflow	API choice	Why
Blocking pre-merge check	Synchronous	Developer is waiting; up to 24h is unacceptable
Overnight technical debt report	Message Batches	50% cost saving; latency tolerant
Weekly audit	Message Batches	Same
Real-time customer support agent	Synchronous	Multi-turn tool calling not supported in batch

The Message Batches API gives 50% cost savings and a processing window up to 24 hours with no latency SLA. It does not support multi-turn tool calling. Use custom_id to correlate request/response pairs and to identify failed documents for resubmission.

Multi-instance and multi-pass review

Self-review is biased. A model that just generated code retains reasoning context that makes it less likely to question its own decisions. Use a second instance.
Per-file passes + cross-file integration pass beats a single pass over many files.
Verification passes with self-reported confidence per finding enable calibrated review routing. The calibration belongs in the routing logic, not in trusting the score.

Part 07

Domain 5 · Context management & reliability 15%

Six task statements. The smallest domain by weight, but the one with the highest density of "anti-patterns that are always wrong" gotchas.

Context management for long interactions

Progressive summarization loses numbers. Amounts, dates, percentages, customer-stated expectations get vague-ified. Extract them into a persistent "case facts" block included in every prompt outside the summarized history.
Lost-in-the-middle. Models reliably process input at the start and end. Put key findings summaries at the top of aggregated inputs; use explicit section headers for detailed results.
Trim verbose tool outputs. An order lookup with 40+ fields, of which only 5 are relevant to the current task, should be trimmed before accumulating in context.
Pass complete history. Subsequent API requests need the full conversation to maintain coherence. There's no server-side session.
Restructure upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content when downstream agents have limited context budgets.

Escalation: the rules

Honor explicit customer requests for a human immediately. Don't first try to investigate. Don't "see if I can help with that." Just escalate.
Acknowledge frustration, then offer resolution if the issue is in scope. Escalate only if the customer reiterates the preference. (This is the nuanced case: explicit demand → escalate; implicit frustration with an in-scope issue → offer to help, then escalate if pushed.)
Escalate when policy is silent on the customer's specific situation (competitor price-matching when policy only addresses own-site adjustments).
Multiple matches → ask for more identifiers. Don't pick by heuristic.
Escalation criteria belong in the system prompt with few-shot examples. Not in a separate classifier model. Not in self-reported confidence. Not in sentiment analysis.

Always wrong: sentiment-based escalation routing, model self-reported confidence as the trigger, and separate ML classifiers when prompt optimization hasn't been tried.

Error propagation in multi-agent systems

Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
Distinguish access failures from valid empty results.
Local recovery in subagents for transient failures. Only propagate what can't be resolved locally, with what was attempted and any partial results.
Coverage-annotated synthesis output. Mark which findings are well-supported and which topic areas have gaps due to unavailable sources. Don't paper over gaps.

Large-codebase exploration

Context degrades in long sessions. The model starts referencing "typical patterns" instead of the specific classes it found earlier. Counter with scratchpad files.
Subagent delegation isolates verbose discovery from the main coordination context.
Structured state exports (manifests) for crash recovery. Each agent dumps state to a known location; the coordinator loads the manifest on resume.
/compact reduces context usage when extended exploration has filled it.

Human review and confidence calibration

Aggregate accuracy hides per-segment failure. 97% overall can mask 60% on one document type. Stratify by document type and field before automating.
Stratified random sampling of high-confidence extractions catches novel error patterns over time.
Field-level confidence calibrated against a labeled validation set is the right routing signal. Raw model self-reports are not.
Route low-confidence and ambiguous-source extractions to human review. Reviewer capacity is finite; spend it on uncertainty.

Multi-source synthesis: provenance and conflicts

Claim-source mappings travel with findings through every stage of synthesis. Source URLs, document names, excerpts, dates.
Conflicting statistics: annotate with attribution. Don't pick one. Distinguish well-established findings from contested ones in the report.
Publication and collection dates are mandatory in structured outputs. Without them, temporal differences look like contradictions.
Render content types appropriately. Financial data as tables, news as prose, technical findings as structured lists. Don't force everything into one shape.

Part 08

Question patterns the exam rewards

After working through the sample questions, a few grammars repeat. If you can recognize them, you can answer in 30 seconds instead of two minutes.

Pattern A · "Production data shows X is happening N% of the time. What change would most effectively address this?"

The right answer is the smallest mechanism that addresses the root cause. Read the cause carefully. Is it a missing prerequisite (programmatic hook), a tool-selection ambiguity (better descriptions), or an unclear decision boundary (explicit criteria with few-shot)? The wrong answers will be (a) heavier infrastructure than needed, (b) prompt-only fixes for problems that need determinism, or (c) addressing the wrong layer (tool availability when the problem is tool ordering).

Pattern B · "All subagents complete successfully but the output is wrong. Root cause?"

Almost always the coordinator's task decomposition was too narrow. Don't blame downstream agents that worked correctly within their assigned scope. Read the coordinator's logs in the question; they reveal the answer directly.

Pattern C · "Latency is too high; how do we reduce overhead?"

Look for the principle of least privilege. Give the agent that needs frequent simple lookups a scoped tool for the common case; keep complex cases routing through the coordinator. Don't over-provision (give it all the web search tools), don't speculatively cache, don't batch in a way that creates blocking dependencies.

Pattern D · "How should this failure flow back?"

Structured error context with failure type, attempted query, partial results, and alternatives. Generic statuses, suppressed errors marked as success, and workflow-terminating exceptions are all anti-patterns.

Pattern E · "Where should this configuration live?"

If it should reach every developer via git: project-scope (.claude/...). If it's personal: user-scope (~/.claude/...). If it applies to files spread across directories: path-scoped rules with glob patterns. If it should run on demand and not pollute context: a skill with context: fork.

Pattern F · "Single-pass review of N files is inconsistent."

Per-file passes for local issues + a separate cross-file integration pass. Not a bigger model, not consensus voting, not pushing the burden to developers.

Pattern G · "Use Message Batches API for both workflows?"

Match each workflow to the API. Batch for non-blocking, latency-tolerant. Synchronous for blocking. Never both with a "fallback if too slow." That's added complexity for no gain.

Part 09

The 10-hour lab plan

Reading the exam guide gets you maybe 60% of the way. The other 40% is the muscle memory that comes from actually building the things. This lab plan is structured to give you concrete reps on every domain in roughly ten hours of build time. It's organized as four labs that compose into one realistic system.

Don't just read these. Build them. The questions become almost trivial once you've debugged the failure modes yourself.

Fast pass for all three labs: the working scaffolds live at github.com/darindeters/claude-architect-labs. Clone the repo, pick a lab, run the boot command. Each lab boots in 30 seconds and the TODO blocks map exactly to the numbered steps below.

Lab 1: A customer support agent with a hook-enforced prerequisite

≈ 3 hours · domains 1, 2, 5

Goal: Build the customer support agent from Scenario 1. By the end, you will have written code that demonstrates every Domain 1 distinction the exam tests.

⚡

Fast pass

Skip the plumbing, keep the lessons

Working agent loop, four MCP-style tools with deliberately weak descriptions, hook placeholders, fixtures with ten test customers, and a runnable python -m src.agent entry point. The TODOs in tools.py, hooks.py, and case_facts.py are exactly the steps below. Boots in 30 seconds.

★ Open Lab 1 on GitHub git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab1-cs-agent

Before you start

Account & budget. An Anthropic Console account with API access. Budget about $5; you'll burn most of it during the misroute-then-fix exercise in step 2.
Pick one language. Python (recommended; the Agent SDK examples are densest here) or TypeScript. Don't flip mid-lab.

Install:

# Python (recommended)
python -m venv .venv && source .venv/bin/activate
pip install anthropic claude-agent-sdk

# or TypeScript
npm init -y && npm i @anthropic-ai/claude-agent-sdk @anthropic-ai/sdk

Auth: export ANTHROPIC_API_KEY=sk-ant-... in the same shell. Don't put it in your code.
Scaffold three files before writing any agent logic:
- tools.py: fake get_customer, lookup_order, process_refund, escalate_to_human that return hard-coded data for ten test customers. Don't bother with a real DB; a Python dict is enough.
- hooks.py: empty for now; the prerequisite and threshold hooks land here in steps 3 and 4.
- agent.py: the loop and the main() entry point.
Test fixtures: ten fake customer records in JSON with at least one customer who has multiple orders, one with no orders, and one whose default refund would exceed $500. You need these edge cases to verify the hooks in step 4.
Reference docs: the Agent SDK overview and the MCP spec. Skim, don't read end-to-end.

Scaffold an Agent SDK loop

Build a loop that sends a request, inspects stop_reason, executes any requested tools, appends results to history, and continues until stop_reason == "end_turn". Resist the urge to add an iteration cap as the primary stop condition. Add one as a safety net only.

Implement four MCP tools

get_customer, lookup_order, process_refund, escalate_to_human. Write deliberately vague descriptions on the first pass and observe the agent misroute. Then expand each description with input formats, example queries, and explicit "use this when…" / "don't use this when…" boundaries. The improvement is the lesson.

Add a programmatic prerequisite

Block process_refund until get_customer has returned a verified customer ID. Test that the agent can't bypass it even with a system prompt that says "skip verification." This is the deterministic-vs-probabilistic distinction.

Add a refund-threshold hook

Intercept outgoing process_refund calls and block any over $500, redirecting to escalate_to_human with a structured handoff (customer ID, root cause, refund amount, recommendation).

Add structured errors

Each tool returns isError, errorCategory, isRetryable, and a customer-friendly message. Force a transient failure (random 30% timeout) and observe the agent retry. Force a business-rule failure and observe it explain rather than retry.

Add a "case facts" persistent block

Extract amounts, dates, order numbers, customer-stated expectations into a structured block included in every prompt. Run a 20-turn conversation and verify nothing important degrades into vague summary.

What success looks like

Running python agent.py "I need a refund on order #1003" walks through get_customer → lookup_order → process_refund end-to-end without errors.
Asking for a refund of $750 triggers escalate_to_human instead of process_refund, even if you change the system prompt to insist on auto-approving.
Commenting out the prerequisite hook reveals the agent occasionally skipping get_customer; restoring it makes the violation impossible. That comparison is the lesson.
A 30%-rate transient failure (you can simulate by raising an exception sometimes from lookup_order) results in retry behavior; a business-rule failure (e.g. "refund window expired") results in the agent explaining to the customer rather than retrying.

Lab 2: Multi-agent research with parallel subagents and provenance

≈ 3 hours · domains 1, 2, 5

Goal: Build the research system from Scenario 3. By the end, you will have lived through the coordinator-decomposition failure mode, the parallel-spawn latency win, and the provenance-loss anti-pattern.

⚡

Fast pass

Skip the plumbing, keep the lessons

Coordinator + four AgentDefinitions wired up. Includes both procedural and goals-and-criteria coordinator prompts so you can run the failure case directly. Five fake search results, four fixture documents (two with a planted contradiction on a 2024 vs 2025 statistic, which is a temporal difference rather than real disagreement), and timeable parallel/sequential spawning toggle.

★ Open Lab 2 on GitHub git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab2-research

Before you start

Reuse Lab 1's environment. Same .venv, same ANTHROPIC_API_KEY, same SDK install. Just create a new directory lab2-research/ alongside it.
Budget ~$10. Multi-agent runs make more API calls than Lab 1; the parallel-spawn comparison alone runs the system twice.
Don't try to integrate real web search yet. Fake the web_search tool with hard-coded JSON results for one or two seed queries. The exam isn't about search APIs; it's about coordinator decisions and information flow.
Scaffold five files before adding logic:
- fixtures/search_results.json: five fake search hits per query, each with { url, title, snippet, publication_date }.
- fixtures/documents/: three to five plain-text documents (paste from Wikipedia or a public arXiv abstract). Two should contradict each other on a specific statistic; you'll need that for step 5.
- subagents.py: four AgentDefinitions: web_search, doc_analysis, synthesis, report_gen. Each gets a tight allowedTools list scoped to its role.
- coordinator.py: the orchestrator. Its allowedTools must include "Task" or it can't spawn anything.
- main.py: the CLI entry: python main.py "impact of AI on creative industries".
Pick a research topic that's broader than it looks. "Impact of AI on creative industries" is a great test case because lazy decomposition will miss music, writing, and film. That's exactly the failure mode step 3 forces you to confront.
Reference docs: the Agent SDK section on subagents and the Task tool. Note that subagents do not inherit conversation history; you must pass everything in the prompt.

Coordinator + four subagents

Web search, document analysis, synthesis, report generator. Each AgentDefinition with a tight allowedTools set. The coordinator's allowedTools must include "Task".

Parallel spawning

Have the coordinator emit multiple Task calls in a single response. Time it. Then refactor to sequential and time again. The latency delta is the lesson.

Force the decomposition failure

Run the system on "impact of AI on creative industries" with a coordinator prompt that's overly procedural. Watch it produce a narrow report. Switch the prompt to one that specifies goals and quality criteria. Re-run.

Structured findings with claim-source mappings

Each finding is { claim, evidence_excerpt, source_url, source_name, publication_date }. Synthesis preserves the mapping through to the final report.

Inject conflicting sources

Plant two sources with contradicting statistics. Verify the report annotates both with attribution rather than picking one. Verify dates prevent temporal differences from being labelled contradictions.

Add a scoped verify_fact tool to synthesis

For the simple-verification common case. Complex verifications still route through the coordinator. Measure the round-trip reduction.

Simulate a web-search timeout

Verify the subagent returns structured error context (failure type, attempted query, partial results) rather than a generic "search unavailable," and that the coordinator can proceed with partial results plus a coverage-gap annotation.

What success looks like

One coordinator request produces a final report citing at least three distinct sources, each tagged with URL and publication date.
Timing the parallel-spawn version vs sequential: the parallel version should be at least 2× faster on a four-subagent fan-out. (If it isn't, you're spawning across turns instead of in one response.)
The decomposition failure case: with a procedural coordinator prompt, the report misses obvious topic areas. With a goals-and-criteria prompt, it covers them. Both runs are the lesson.
A planted contradiction in two fixture documents shows up in the final report as two annotated values with attribution, not one value silently chosen.
Killing the web_search subagent mid-run produces a final report that explicitly notes coverage gaps, rather than terminating the workflow or silently producing a partial answer marked as success.

Lab 3: Claude Code for a real team workflow

≈ 2 hours · domains 2, 3

Goal: Configure a project the way Domain 3 questions assume it's configured. Pick any repo you have write access to.

Hierarchy

CLAUDE.md at three levels

Root CLAUDE.md with universal team standards
One subdirectory CLAUDE.md for area-specific conventions
Test that running Claude in different directories pulls the right merge of rules via /memory

Path-scoped

.claude/rules/ with globs

One rule file with paths: ["**/*.test.*"] for test conventions
One with paths: ["src/api/**/*"] for API handler conventions
Edit a test file outside the API directory; verify only the test rule loads

Slash commands

Project-scoped `/review`

Create .claude/commands/review.md
Commit it; have a teammate clone and verify it's available
Compare to a personal command in ~/.claude/commands/

Skills

An isolated skill

SKILL.md with context: fork and a restrictive allowed-tools
Add argument-hint for required parameters
Verify its output doesn't pollute the main conversation

MCP

Project + user MCP servers

One MCP server in .mcp.json with ${API_TOKEN} expansion
One personal experimental server in ~/.claude.json
Verify both load simultaneously and tools from both are available

Plan vs direct

Three tasks of varying scope

A single-file bug fix with a clear stack trace → direct execution
A library migration touching dozens of files → plan mode
An open-ended new feature → plan mode (then direct execute the plan)

Lab 4: Structured extraction with validation and batch

≈ 2 hours · domains 4, 5

Goal: Build the extraction system from Scenario 6. By the end, you will have practiced every Domain 4 schema-design decision the exam tests.

⚡

Fast pass

Skip the plumbing, keep the lessons

Pydantic schema with optional/nullable, enum + "other" + detail, and an "unclear" enum value. Tool_use call with forced tool_choice. Ten invoice fixtures including five clean ones and five planted edge cases (missing fields, bad totals, informal dates, off-enum category, internal contradiction). Validators and batch are stubbed for steps 5 and 6.

★ Open Lab 4 on GitHub git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab4-extract

Before you start

Reuse the environment from Lab 1. New directory lab4-extract/; same .venv, same API key. Add Pydantic:
```
pip install anthropic pydantic
```
Pick one document type and stick with it. The exam questions assume you know your corpus; switching types mid-lab burns time. Three good choices, in order of difficulty:
- Invoices (easiest): clear required fields (vendor, line items, totals, dates). Find sample PDFs on the IRS site or invoice-template repos.
- Scientific abstracts: varied structure (inline citations vs bibliographies) is great practice for step 3's few-shot examples.
- SEC 10-K excerpts (hardest): long, heterogeneous, lots of nullable fields. Skip on first pass unless you're comfortable.
Don't start with the batch API. Get the synchronous flow working on five documents first; step 6 introduces batching only after the schema is reliable. Most lab failures come from people skipping ahead to batching and debugging two layers at once.
Scaffold four files before extracting anything:
- schema.py: Pydantic models. Required fields are exceptions; default to optional/nullable so you can demonstrate the "fabrication when required" failure mode.
- extract.py: single-document messages.create call with tool_choice forcing your extraction tool. Returns a Pydantic model.
- validate.py: semantic validators (e.g., line_items.sum() == stated_total; conflict detection between fields).
- batch.py: last. Message Batches client. Don't open this file until extract.py is reliable on at least 5/5 documents.
Sample corpus: ten to twenty source documents in a fixtures/ directory. Plant deliberate problems: one with absent required info (to test step 2's nullable behavior), one with totals that don't add up (step 5), one with informal date formatting (step 3's normalization).
Reference docs: the tool use guide and the Message Batches API reference. Note Batches has no multi-turn tool calling; you can't do validation-retry inside a batch request.

Define an extraction tool

JSON schema with required, optional (nullable), and enum fields. One enum should have an "other" value paired with a detail string field. Use tool_choice set to force this specific tool first.

Process documents with absent fields

Extract from documents where some fields don't appear. Verify the model returns null rather than fabricating values. Then mark those fields required and watch hallucination kick in. Revert.

Validation-retry loop

On schema-validation failure, send a follow-up with the document, the failed extraction, and the specific validation error. Track which errors resolve via retry (format mismatches) and which don't (information genuinely absent).

Add few-shot examples for varied formats

Show extraction from inline-citation papers, papers with bibliographies, papers with embedded methodology, papers with explicit methodology sections. Two to four targeted examples. Measure the empty-field rate before and after.

Self-correction validators

Have the extractor return both calculated_total and stated_total; flag discrepancies. Add a conflict_detected boolean for inconsistent source data.

Batch processing

Submit 100 documents via the Message Batches API with a custom_id per request. Calculate the submission cadence to guarantee a 30-hour SLA given the 24-hour processing window. Resubmit any failures by custom_id, chunking ones that exceeded context limits.

Confidence calibration & stratified sampling

Output field-level confidence. Calibrate thresholds against a small labeled validation set. Stratify accuracy measurement by document type and field. Verify your "97% overall" isn't masking a 60% on one segment.

What success looks like

extract.py runs cleanly on at least 8 of your 10 source documents and returns a valid Pydantic model.
The document with deliberately-absent required info: optional fields return null; flip them to required and re-run, watch the model fabricate plausible values to satisfy the schema. That side-by-side is the lesson.
The validation-retry loop visibly resolves a format error (e.g., date in "03/14/26" format vs ISO) on the second try. The same loop visibly fails to resolve when the source document genuinely lacks a required field, proving retry isn't magic.
The planted "totals don't add up" document gets caught by your semantic validator and routed to the conflict_detected path rather than passing through silently.
Your batch run on 100 documents (step 6) handles failures by custom_id: failed documents are identifiable, resubmittable, and you can show the SLA math (24-hour processing window vs your target).
Stratified accuracy measurement reveals at least one segment that performs measurably worse than the aggregate. If everything is uniformly excellent, you didn't plant enough variety in the corpus.

Optional Lab 5: A CI/CD review pipeline

≈ 1 hour · domains 3, 4

If you have the appetite: wire Claude Code into a GitHub Actions workflow with claude -p --output-format json --json-schema, post structured findings as inline PR comments, and re-run on subsequent commits while passing prior findings to avoid duplicates. This makes the Scenario 5 questions feel familiar instead of theoretical.

Part 10

What is NOT on the exam

The exam guide is explicit about what's out of scope. Don't waste prep time on these:

Fine-tuning, training, or model weights
Authentication, billing, account management, OAuth flows, API key rotation
Deploying or hosting MCP servers (infrastructure, networking, container orchestration)
Constitutional AI, RLHF, safety-training methodology
Embedding models or vector database internals
Computer use (browser automation, desktop interaction)
Vision and image analysis
Streaming API implementation, server-sent events
Rate limiting, quotas, pricing math
Specific cloud provider configs (AWS, GCP, Azure)
Performance benchmarking, model comparison metrics
Prompt caching internals (beyond knowing it exists)
Token counting algorithms or tokenization specifics

This list is generous. If a prep resource is heavy on any of the above, it isn't aligned with this exam.

Part 11

Day-of strategy

Reading the question

The scenario is one of six. Recognize it in the first sentence. Then go to the question.
Identify which domain is being tested. Each scenario maps to a primary set of domains; the question will be in one of them.
Look for the "production data shows…" or "logs show…" framing. The setup usually contains the root cause if you read carefully. The right answer addresses what the logs say happened, not what you'd guess.
Ask the five judgments from the opening section: deterministic-when-required, model-driven-decisions, proportionate-first-response, isolated-subagent-context, structure-beats-summary.

Reading the answers

Eliminate the always-wrong distractors first. Sentiment routing, self-reported confidence as a routing signal, "use a bigger model," "suppress the error and return success," "terminate the workflow." If any of those appear, scratch them.
Eliminate over-engineered answers. Separate classifier model, ML pipeline, custom routing layer: these are usually wrong when a smaller mechanism would do. Unless the question explicitly says "we've tried the simple thing and it didn't work."
Eliminate "addresses the wrong layer" answers. If the problem is tool ordering, an answer about tool availability is wrong. If the problem is decomposition, an answer about the synthesis agent is wrong.
Of the remaining one or two, pick the smallest mechanism that fixes the root cause.

Time and pacing

There's no penalty for guessing. Answer everything. An unanswered question is scored as incorrect.
If a question is taking more than two minutes, mark and move on. Come back with fresh eyes.
Trust your first read on the pattern questions (Patterns A–G). Second-guessing usually moves you toward over-engineered distractors.

The week before

Re-read the six scenarios until you can recall each one cold.
Re-read the sample questions and explanations. The explanations teach the test's grammar.
Run through your Lab 1 and Lab 2 code one more time. The pattern questions become trivial when the code is fresh.
Take the practice exam if you have access. It's the highest-leverage hour you can spend.

Part 12

Closing

This exam tests one thing more than any other: do you have the disposition of someone who has shipped agentic systems to production? The candidate who has lived through a misrouted refund, a coordinator that decomposed too narrowly, or an extraction schema that hallucinated to satisfy a required field: that candidate already knows most of the answers.

The exam is generous to the practitioner and unforgiving to the memorizer. Build the labs. The questions will look like things you've already debugged. — TWD

If you internalize the five judgments, recognize the seven question patterns, and complete the four labs, the rest is reading the question carefully and matching the answer to the smallest mechanism that fixes the root cause. That's the test. Good luck.

Companion reading on the same site: Best Practices covers the model-agnostic disposition this exam tests. Claude Code covers the tool you'll use in Domain 3.

How this exam thinks

The five judgments the exam keeps testing

The anti-patterns that are always wrong

The six scenarios

Memorize these six setups

Domain 1 · Agentic architecture & orchestration 27%

The agentic loop, in one paragraph

Coordinator–subagent: the rules you must know

Hooks vs prompts: the deterministic line

Hooks Programmatic

Prompts Probabilistic

Session management: resume vs fork vs fresh start

Decomposition strategy

Domain 2 · Tool design & MCP integration 18%

What a good tool description contains

Splitting vs consolidating tools

Structured error responses

Tool distribution across agents

tool_choice: three modes, one common mistake

MCP server scoping

Built-in Claude Code tools

Domain 3 · Claude Code configuration & workflows 20%

The configuration hierarchy

Slash commands vs skills

Plan mode vs direct execution

Plan mode

Direct execution

Iterative refinement: when to do what

Claude Code in CI/CD

Domain 4 · Prompt engineering & structured output 20%

Explicit criteria over vague instructions

Few-shot examples: when and how many

tool_use with JSON schemas: the only reliable path

Validation, retry, feedback loops

Batch processing: when it's right and when it's wrong

Multi-instance and multi-pass review

Domain 5 · Context management & reliability 15%

Context management for long interactions

Escalation: the rules

Error propagation in multi-agent systems

Large-codebase exploration

Human review and confidence calibration

Multi-source synthesis: provenance and conflicts

Question patterns the exam rewards

Pattern A · "Production data shows X is happening N% of the time. What change would most effectively address this?"

Pattern B · "All subagents complete successfully but the output is wrong. Root cause?"

Pattern C · "Latency is too high; how do we reduce overhead?"

Pattern D · "How should this failure flow back?"

Pattern E · "Where should this configuration live?"

Pattern F · "Single-pass review of N files is inconsistent."

Pattern G · "Use Message Batches API for both workflows?"

The 10-hour lab plan

Lab 1: A customer support agent with a hook-enforced prerequisite

Skip the plumbing, keep the lessons

Lab 2: Multi-agent research with parallel subagents and provenance

Skip the plumbing, keep the lessons

Lab 3: Claude Code for a real team workflow

CLAUDE.md at three levels

.claude/rules/ with globs

Project-scoped /review

An isolated skill

Project + user MCP servers

Three tasks of varying scope

Lab 4: Structured extraction with validation and batch

Skip the plumbing, keep the lessons

Optional Lab 5: A CI/CD review pipeline

What is NOT on the exam

Day-of strategy

Reading the question

Reading the answers

Time and pacing

The week before

Closing

`tool_choice`: three modes, one common mistake

`tool_use` with JSON schemas: the only reliable path

Project-scoped `/review`