№ CertLearn With Darin · Study companion
Claude Certified Architect Foundations.
A study companion for Anthropic's first solution-architect exam. The concepts you have to truly understand, the patterns the questions reward, and a hands-on lab plan so you sit the exam with the muscle memory of someone who has actually built the thing.
- Format
- Multiple choice, 4 options
- Structure
- 4 of 6 scenarios per form
- Scoring
- Scaled 100–1,000
- Pass
- 720
- Penalty for guessing
- None; answer everything
- Target candidate
- 6+ months hands-on
- D1 · Agentic Architecture & Orchestration27%
- D3 · Claude Code Configuration & Workflows20%
- D4 · Prompt Engineering & Structured Output20%
- D2 · Tool Design & MCP Integration18%
- D5 · Context Management & Reliability15%
How this exam thinks
Before any domain content, internalize the disposition the exam is testing. Almost every "best answer" question rewards the same handful of judgments. If you can recognize them, you can answer questions on topics you only half-remember.
The five judgments the exam keeps testing
- Deterministic over probabilistic when business logic depends on it. If a rule must hold (verify identity before refund, block refunds over $500, force a tool to run first), the right answer is a programmatic mechanism: a hook, a prerequisite gate, a forced
tool_choice. The wrong answers are "improve the system prompt" and "add few-shot examples." Prompt-based compliance is probabilistic; financial and identity workflows can't tolerate that. - Model-driven decisions over hand-rolled control flow. Loops should branch on
stop_reason. Coordinators should decide which subagents to invoke. Don't parse natural-language signals to decide "is the agent done." Don't pre-route requests with a regex classifier when a tool description fix gets you 80% of the way there. - Match the proportionate first response. When the question describes a problem and offers four fixes, the right answer is usually the smallest mechanism that addresses the root cause. Tool descriptions are minimal? Fix the descriptions. Escalation is wrong? Add explicit criteria with few-shot examples. Don't reach for a separate classifier model, ML pipeline, or higher-tier model when the cheap fix hasn't been tried.
- Subagents have isolated context. Always. Subagents do not inherit the coordinator's history. The coordinator must pass complete findings, structured metadata, and quality criteria explicitly in the subagent's prompt. Whenever an answer hinges on "the subagent will know X," verify X was passed in.
- Structure beats summary. When in doubt, the right answer is "preserve structured data": claim-source mappings, error categories with retryable flags, dates and provenance, conflicting values both annotated. The wrong answer is "summarize and pick one" or "return generic status."
The anti-patterns that are always wrong
The exam is generous: a handful of distractors are designed to be obviously wrong once you know the pattern. Burn these into memory:
- Self-reported model confidence as a routing signal. The model is already overconfident on the cases it's getting wrong. Confidence scores from the model itself are not calibrated. (You can use field-level confidence calibrated against a labeled validation set; that's different.)
- Sentiment analysis as a complexity signal. A frustrated customer with a simple problem and a calm customer with a complex one both exist. Sentiment doesn't predict case complexity.
- "A bigger model / larger context window will fix it." Almost always wrong. Attention dilution, decomposition errors, and unclear criteria don't yield to more tokens.
- "Suppress the error and return success." Catching a timeout and returning empty results as a successful query is always wrong. The coordinator loses the context it needs to recover.
- "Terminate the entire workflow on any failure." Equally wrong from the other direction. Local recovery in subagents, structured propagation to the coordinator, partial results with coverage annotations.
- Iteration caps and natural-language loop termination. An agentic loop terminates on
stop_reason == "end_turn". Not when the assistant says "Done!" Not after N iterations as the primary stopping mechanism.
The six scenarios
Every form draws four scenarios at random from the same six. Knowing them in advance is a real edge: when a question lands, you already have the system in your head and can focus on the specific judgment being tested.
Memorize these six setups
- Customer support resolution agent. Agent SDK. MCP tools:
get_customer,lookup_order,process_refund,escalate_to_human. Target: 80%+ first-contact resolution. Domains: D1, D2, D5. - Code generation with Claude Code. Custom slash commands, CLAUDE.md, plan mode vs direct execution. Domains: D3, D5.
- Multi-agent research system. Coordinator delegates to web search, document analysis, synthesis, and report generator subagents. Domains: D1, D2, D5.
- Developer productivity tools. Built-in Read/Write/Bash/Grep/Glob plus MCP. Domains: D2, D3, D1.
- Claude Code in CI/CD. Automated reviews, test generation, PR feedback. Non-interactive flag, JSON output. Domains: D3, D4.
- Structured data extraction. JSON schemas, validation, edge cases. Domains: D4, D5.
When a scenario opens, read the setup once and then go straight to the question. The setup is rarely a trick; the question is where the points live.
Domain 1 · Agentic architecture & orchestration 27%
The biggest single block of the exam. Seven task statements, all about how an agent loop runs and how multi-agent systems coordinate. If you only have time to deeply prepare for one domain, prepare for this one.
The agentic loop, in one paragraph
You send a request to Claude. The response has a stop_reason. If it's "tool_use", you execute the requested tool, append the result to the conversation history, and send the next request. If it's "end_turn", you stop. That's the whole loop. The model decides when it's done, not your code.
Coordinator–subagent: the rules you must know
- Hub-and-spoke architecture. All inter-subagent communication routes through the coordinator. This isn't aesthetic; it's where observability and consistent error handling live.
- Subagents have isolated context. They do not inherit the coordinator's conversation history. The coordinator must pass complete findings (including structured metadata like source URLs, document names, and dates) directly in the subagent's prompt.
- Spawn parallel subagents in a single response. The coordinator emits multiple
Tasktool calls in one turn, not across separate turns. This is how you get the latency win. allowedToolsmust include"Task"for a coordinator to spawn subagents. If the question describes "the coordinator can't delegate," check this first.- Coordinator decomposition is the most common failure mode. If subagents complete successfully but the final report misses topic areas, the coordinator's task decomposition was too narrow. Don't blame downstream agents.
- Coordinator prompts specify goals and quality criteria, not procedural steps. "Find comprehensive coverage of X" beats "First do A, then do B, then do C." Subagents need room to adapt.
Hooks vs prompts: the deterministic line
This is the single most-tested distinction in Domain 1. Memorize it cold:
Hooks Programmatic
For deterministic guarantees
- Identity verification before financial action
- Blocking refunds above a policy threshold
- Forcing
extract_metadatato run before enrichment - Normalizing heterogeneous data formats via
PostToolUse - Anything where a non-zero failure rate is unacceptable
Prompts Probabilistic
For guidance and tone
- Style, voice, helpfulness
- Decomposition strategy
- Soft preferences ("prefer concise responses")
- Escalation criteria with few-shot examples
- Anything where best-effort compliance is enough
Session management: resume vs fork vs fresh start
--resume <session-name>picks up a named session. Use when prior context is mostly still valid and you're continuing the same line of work.fork_sessioncreates an independent branch from a shared baseline. Use to explore divergent approaches (testing strategy A vs B, refactor approach 1 vs 2) from the same analysis.- Fresh start with an injected summary beats resuming when prior tool results are stale. After significant code modifications, telling a resumed session "by the way, file X changed in these specific ways" is better than expecting it to figure that out.
Decomposition strategy
- Prompt chaining (fixed sequential pipeline) for predictable multi-aspect work. Example: per-file local review pass plus a cross-file integration pass.
- Dynamic adaptive decomposition for open-ended investigation. Example: "add comprehensive tests to a legacy codebase" should map structure first, then prioritize, then expand subtasks as dependencies surface.
- Multi-concern messages get decomposed in parallel. A customer with three issues becomes three concurrent investigations sharing context, not three sequential conversations.
Domain 2 · Tool design & MCP integration 18%
Five task statements. The headline lesson: tool descriptions are the primary mechanism the model uses for tool selection. When two tools have minimal, similar descriptions, the model chooses badly. Half of Domain 2's right answers begin with "improve the descriptions."
What a good tool description contains
- Purpose: what the tool does, in one sentence that distinguishes it from neighbors
- Input format, including examples of valid inputs
- Example queries that should call this tool
- Edge cases and boundaries: when to use this tool vs a similar alternative
- What the output looks like, so the model knows what to do with the result
If the question describes "agent picks the wrong tool" and the answer choices include "expand descriptions to include input formats, example queries, edge cases, and boundaries," that's the answer. Few-shot examples and routing classifiers are over-engineered for this problem.
Splitting vs consolidating tools
A generic analyze_document with overloaded behavior is a smell. Split it into extract_data_points, summarize_content, and verify_claim_against_source. Each gets a defined input/output contract. The exam consistently rewards splitting.
The opposite move (consolidating two tools into one polymorphic lookup_entity) is sometimes architecturally valid but is rarely the correct "first step" answer. The proportionate-response rule applies.
Structured error responses
Every MCP tool failure should return:
isError: true(the MCP flag)errorCategory:transient/validation/business/permissionisRetryable: boolean- Human-readable description for the agent (and end-user via the agent)
Generic "Operation failed" responses are always wrong. They prevent the agent from making appropriate recovery decisions. A business-rule violation (refund exceeds policy) should be marked non-retryable with a customer-friendly explanation; a transient timeout should be marked retryable.
Tool distribution across agents
- Tool count matters. Giving an agent 18 tools when 4–5 would do degrades selection reliability. Scope each agent's toolset to its role.
- Cross-role tools are okay for high-frequency needs. A synthesis agent that constantly needs simple fact-checks gets a scoped
verify_facttool while complex verifications still route through the coordinator. - Replace generic with constrained.
fetch_urlbecomesload_documentwith URL validation; the constrained version prevents misuse.
tool_choice: three modes, one common mistake
"auto": model chooses whether to call a tool or return text. The default."any": model must call some tool but picks which. Use to guarantee structured output when multiple extraction schemas exist.{"type": "tool", "name": "..."}: forced selection of a specific tool. Use to ensure a particular extraction (likeextract_metadata) runs before downstream enrichment.
The trap: people reach for forced selection when "any" would do. Forced selection is for ordering; "any" is for guaranteeing a structured response.
MCP server scoping
- Project-scoped
.mcp.json: shared team tooling, version-controlled, with${ENV_VAR}expansion for credentials. - User-scoped
~/.claude.json: personal experiments that don't belong in the team config. - Both load simultaneously; their tools are all available to the agent at once.
- MCP resources (not tools) are for content catalogs (issue summaries, schemas, doc hierarchies) that reduce exploratory tool calls.
Built-in Claude Code tools
Know what each one is for:
- Grep: content search. Finding callers, error strings, imports.
- Glob: path pattern matching.
**/*.test.tsx, file-by-extension. - Read: load full file contents.
- Write: full file write. The fallback when Edit can't find unique anchor text.
- Edit: targeted modification by unique text match. Fast but fails on ambiguous anchors.
- Bash: shell execution.
The exam-relevant move: build understanding incrementally with Grep first, then follow imports with Read. Don't read every file upfront.
Domain 3 · Claude Code configuration & workflows 20%
Six task statements. The headline lesson: configuration hierarchy and scoping decisions are the difference between "works on my machine" and "works for the team."
The configuration hierarchy
| Location | Scope | Shared via git? | Use for |
|---|---|---|---|
~/.claude/CLAUDE.md | User | No | Personal preferences only |
.claude/CLAUDE.md or root CLAUDE.md | Project | Yes | Team coding standards |
Subdirectory CLAUDE.md | Directory | Yes | Area-specific conventions |
.claude/rules/*.md with paths: frontmatter | Glob-pattern | Yes | Conventions that span directories (e.g. all **/*.test.*) |
The exam pattern: a question describes a "team member doesn't have the rules" symptom. The answer is "they're in user-scope; move to project-scope." Or: "test files are spread throughout the codebase and need consistent conventions." The answer is path-scoped rules with glob patterns, not a CLAUDE.md per directory.
Slash commands vs skills
- Slash commands in
.claude/commands/(project) or~/.claude/commands/(user). For team-wide, version-controlled commands, project scope. - Skills in
.claude/skills/with frontmatter:/SKILL.md context: fork: runs in an isolated sub-agent context, output doesn't pollute main conversation. Use for verbose discovery or brainstorming.allowed-tools: restrict tool access during the skill (e.g., disallow destructive Bash).argument-hint: prompt the developer for required parameters when the skill is invoked without args.
- Skills vs CLAUDE.md. Skills are on-demand, task-specific. CLAUDE.md is always-loaded universal standards. Don't put skills' contents in CLAUDE.md, and don't put always-on rules in a skill.
Plan mode vs direct execution
Plan mode
- Architectural decisions
- Multi-file refactors (45+ files)
- Library migrations
- Multiple valid approaches
- Open-ended investigation
Direct execution
- Single-file bug fix with a clear stack trace
- Adding one validation conditional
- Well-understood scope
- One reasonable implementation
The Explore subagent isolates verbose discovery and returns summaries to preserve main context. Use it when an exploration phase will dump pages of file contents the main agent doesn't need to keep around.
Iterative refinement: when to do what
- Concrete input/output examples. The most effective fix when prose descriptions are interpreted inconsistently. Two or three example pairs.
- Test-driven iteration. Write the test suite first, then iterate by sharing test failures.
- Interview pattern. Have Claude ask clarifying questions before implementing in unfamiliar domains. Surfaces considerations you'd miss.
- Single message vs sequential. When fixes interact (e.g., changing one schema affects three call-sites), put all issues in one message. When they're independent, sequential is fine.
Claude Code in CI/CD
-p(or--print): non-interactive mode. The fix when the CI job hangs waiting for input. Don't reach forCLAUDE_HEADLESS=trueor--batch; those are not real flags.--output-format json --json-schema: structured findings parseable for automated PR comments.- Independent review instances beat self-review. The same session that generated code is biased toward defending its decisions. A second instance without the generator's reasoning context catches more.
- Per-file passes plus a cross-file integration pass. This is the fix for "single-pass review of 14 files gives shallow, inconsistent feedback." Attention dilution is the root cause; bigger context windows don't help.
- Including prior review findings when re-running on new commits prevents duplicate comments. Instruct Claude to report only new or still-unaddressed issues.
- CLAUDE.md is how CI gets project context: testing standards, fixture conventions, valuable-test criteria. Without it, generated tests are shallow.
Domain 4 · Prompt engineering & structured output 20%
Six task statements. The headline lesson: specific criteria and few-shot examples beat vague guidance every time, and tool_use with JSON schemas is the only reliable path to structured output.
Explicit criteria over vague instructions
"Be conservative" and "only report high-confidence findings" do nothing for precision. The right shape is explicit categorical criteria with concrete code examples for each severity level. "Flag a comment only when the claimed behavior contradicts the actual code behavior" is the form the exam rewards.
Corollary: when false-positive rates are high in one category, you can temporarily disable that category to restore developer trust while you improve the prompt for it. Don't let one noisy category undermine confidence in the accurate ones.
Few-shot examples: when and how many
- 2–4 targeted examples for ambiguous scenarios (not 10+).
- Show the reasoning for why one action was chosen over plausible alternatives. Don't just show input → output; show the judgment.
- Include examples that distinguish acceptable patterns from genuine issues. This is how you reduce false positives while keeping generalization.
- Examples are most effective for: ambiguous tool selection, format consistency, varied document structures (inline citations vs bibliographies), and edge cases like null fields.
tool_use with JSON schemas: the only reliable path
Schema-compliant structured output requires tool_use. Free-form "respond in JSON format" prompts produce syntax errors. tool_use eliminates them.
- Strict schemas eliminate syntax errors but not semantic errors (line items that don't sum to the total, values placed in the wrong field). Build separate validation for the semantic layer.
- Required vs optional (nullable) fields matter. If a source document may not contain a piece of information, mark the field nullable. A required field tells the model to fabricate a value to satisfy the schema.
- Enum + "other" + detail string is the right pattern for extensible categorization. Adding
"unclear"as an enum value is the right pattern for ambiguous cases. - Format normalization rules live in the prompt alongside the strict schema. The schema doesn't normalize; the prompt does.
Validation, retry, feedback loops
- Retry with the specific validation error appended to the prompt. The model corrects format/structural errors well when told what failed.
- Retries don't help when the information is absent from the source. If a required field can't be filled because the source doc didn't include it, no amount of retrying creates the data. Mark the field nullable instead.
detected_patternfield. A structured-output trick. Tag each finding with the code construct that triggered it, so when developers dismiss findings you can analyze the dismissal patterns systematically.- Self-correction validators. Extract
calculated_totalalongsidestated_totalto flag discrepancies. Addconflict_detectedbooleans to mark inconsistent source data.
Batch processing: when it's right and when it's wrong
| Workflow | API choice | Why |
|---|---|---|
| Blocking pre-merge check | Synchronous | Developer is waiting; up to 24h is unacceptable |
| Overnight technical debt report | Message Batches | 50% cost saving; latency tolerant |
| Weekly audit | Message Batches | Same |
| Real-time customer support agent | Synchronous | Multi-turn tool calling not supported in batch |
The Message Batches API gives 50% cost savings and a processing window up to 24 hours with no latency SLA. It does not support multi-turn tool calling. Use custom_id to correlate request/response pairs and to identify failed documents for resubmission.
Multi-instance and multi-pass review
- Self-review is biased. A model that just generated code retains reasoning context that makes it less likely to question its own decisions. Use a second instance.
- Per-file passes + cross-file integration pass beats a single pass over many files.
- Verification passes with self-reported confidence per finding enable calibrated review routing. The calibration belongs in the routing logic, not in trusting the score.
Domain 5 · Context management & reliability 15%
Six task statements. The smallest domain by weight, but the one with the highest density of "anti-patterns that are always wrong" gotchas.
Context management for long interactions
- Progressive summarization loses numbers. Amounts, dates, percentages, customer-stated expectations get vague-ified. Extract them into a persistent "case facts" block included in every prompt outside the summarized history.
- Lost-in-the-middle. Models reliably process input at the start and end. Put key findings summaries at the top of aggregated inputs; use explicit section headers for detailed results.
- Trim verbose tool outputs. An order lookup with 40+ fields, of which only 5 are relevant to the current task, should be trimmed before accumulating in context.
- Pass complete history. Subsequent API requests need the full conversation to maintain coherence. There's no server-side session.
- Restructure upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content when downstream agents have limited context budgets.
Escalation: the rules
- Honor explicit customer requests for a human immediately. Don't first try to investigate. Don't "see if I can help with that." Just escalate.
- Acknowledge frustration, then offer resolution if the issue is in scope. Escalate only if the customer reiterates the preference. (This is the nuanced case: explicit demand → escalate; implicit frustration with an in-scope issue → offer to help, then escalate if pushed.)
- Escalate when policy is silent on the customer's specific situation (competitor price-matching when policy only addresses own-site adjustments).
- Multiple matches → ask for more identifiers. Don't pick by heuristic.
- Escalation criteria belong in the system prompt with few-shot examples. Not in a separate classifier model. Not in self-reported confidence. Not in sentiment analysis.
Error propagation in multi-agent systems
- Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
- Distinguish access failures from valid empty results.
- Local recovery in subagents for transient failures. Only propagate what can't be resolved locally, with what was attempted and any partial results.
- Coverage-annotated synthesis output. Mark which findings are well-supported and which topic areas have gaps due to unavailable sources. Don't paper over gaps.
Large-codebase exploration
- Context degrades in long sessions. The model starts referencing "typical patterns" instead of the specific classes it found earlier. Counter with scratchpad files.
- Subagent delegation isolates verbose discovery from the main coordination context.
- Structured state exports (manifests) for crash recovery. Each agent dumps state to a known location; the coordinator loads the manifest on resume.
/compactreduces context usage when extended exploration has filled it.
Human review and confidence calibration
- Aggregate accuracy hides per-segment failure. 97% overall can mask 60% on one document type. Stratify by document type and field before automating.
- Stratified random sampling of high-confidence extractions catches novel error patterns over time.
- Field-level confidence calibrated against a labeled validation set is the right routing signal. Raw model self-reports are not.
- Route low-confidence and ambiguous-source extractions to human review. Reviewer capacity is finite; spend it on uncertainty.
Multi-source synthesis: provenance and conflicts
- Claim-source mappings travel with findings through every stage of synthesis. Source URLs, document names, excerpts, dates.
- Conflicting statistics: annotate with attribution. Don't pick one. Distinguish well-established findings from contested ones in the report.
- Publication and collection dates are mandatory in structured outputs. Without them, temporal differences look like contradictions.
- Render content types appropriately. Financial data as tables, news as prose, technical findings as structured lists. Don't force everything into one shape.
Question patterns the exam rewards
After working through the sample questions, a few grammars repeat. If you can recognize them, you can answer in 30 seconds instead of two minutes.
Pattern A · "Production data shows X is happening N% of the time. What change would most effectively address this?"
The right answer is the smallest mechanism that addresses the root cause. Read the cause carefully. Is it a missing prerequisite (programmatic hook), a tool-selection ambiguity (better descriptions), or an unclear decision boundary (explicit criteria with few-shot)? The wrong answers will be (a) heavier infrastructure than needed, (b) prompt-only fixes for problems that need determinism, or (c) addressing the wrong layer (tool availability when the problem is tool ordering).
Pattern B · "All subagents complete successfully but the output is wrong. Root cause?"
Almost always the coordinator's task decomposition was too narrow. Don't blame downstream agents that worked correctly within their assigned scope. Read the coordinator's logs in the question; they reveal the answer directly.
Pattern C · "Latency is too high; how do we reduce overhead?"
Look for the principle of least privilege. Give the agent that needs frequent simple lookups a scoped tool for the common case; keep complex cases routing through the coordinator. Don't over-provision (give it all the web search tools), don't speculatively cache, don't batch in a way that creates blocking dependencies.
Pattern D · "How should this failure flow back?"
Structured error context with failure type, attempted query, partial results, and alternatives. Generic statuses, suppressed errors marked as success, and workflow-terminating exceptions are all anti-patterns.
Pattern E · "Where should this configuration live?"
If it should reach every developer via git: project-scope (.claude/...). If it's personal: user-scope (~/.claude/...). If it applies to files spread across directories: path-scoped rules with glob patterns. If it should run on demand and not pollute context: a skill with context: fork.
Pattern F · "Single-pass review of N files is inconsistent."
Per-file passes for local issues + a separate cross-file integration pass. Not a bigger model, not consensus voting, not pushing the burden to developers.
Pattern G · "Use Message Batches API for both workflows?"
Match each workflow to the API. Batch for non-blocking, latency-tolerant. Synchronous for blocking. Never both with a "fallback if too slow." That's added complexity for no gain.
The 10-hour lab plan
Reading the exam guide gets you maybe 60% of the way. The other 40% is the muscle memory that comes from actually building the things. This lab plan is structured to give you concrete reps on every domain in roughly ten hours of build time. It's organized as four labs that compose into one realistic system.
Fast pass for all three labs: the working scaffolds live at github.com/darindeters/claude-architect-labs. Clone the repo, pick a lab, run the boot command. Each lab boots in 30 seconds and the TODO blocks map exactly to the numbered steps below.
Lab 1: A customer support agent with a hook-enforced prerequisite
≈ 3 hours · domains 1, 2, 5
Goal: Build the customer support agent from Scenario 1. By the end, you will have written code that demonstrates every Domain 1 distinction the exam tests.
Skip the plumbing, keep the lessons
Working agent loop, four MCP-style tools with deliberately weak descriptions, hook placeholders, fixtures with ten test customers, and a runnable python -m src.agent entry point. The TODOs in tools.py, hooks.py, and case_facts.py are exactly the steps below. Boots in 30 seconds.
git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab1-cs-agent
- Account & budget. An Anthropic Console account with API access. Budget about $5; you'll burn most of it during the misroute-then-fix exercise in step 2.
- Pick one language. Python (recommended; the Agent SDK examples are densest here) or TypeScript. Don't flip mid-lab.
- Install:
# Python (recommended) python -m venv .venv && source .venv/bin/activate pip install anthropic claude-agent-sdk # or TypeScript npm init -y && npm i @anthropic-ai/claude-agent-sdk @anthropic-ai/sdk - Auth:
export ANTHROPIC_API_KEY=sk-ant-...in the same shell. Don't put it in your code. - Scaffold three files before writing any agent logic:
tools.py: fakeget_customer,lookup_order,process_refund,escalate_to_humanthat return hard-coded data for ten test customers. Don't bother with a real DB; a Python dict is enough.hooks.py: empty for now; the prerequisite and threshold hooks land here in steps 3 and 4.agent.py: the loop and themain()entry point.
- Test fixtures: ten fake customer records in JSON with at least one customer who has multiple orders, one with no orders, and one whose default refund would exceed $500. You need these edge cases to verify the hooks in step 4.
- Reference docs: the Agent SDK overview and the MCP spec. Skim, don't read end-to-end.
stop_reason, executes any requested tools, appends results to history, and continues until stop_reason == "end_turn". Resist the urge to add an iteration cap as the primary stop condition. Add one as a safety net only.
get_customer, lookup_order, process_refund, escalate_to_human. Write deliberately vague descriptions on the first pass and observe the agent misroute. Then expand each description with input formats, example queries, and explicit "use this when…" / "don't use this when…" boundaries. The improvement is the lesson.
process_refund until get_customer has returned a verified customer ID. Test that the agent can't bypass it even with a system prompt that says "skip verification." This is the deterministic-vs-probabilistic distinction.
process_refund calls and block any over $500, redirecting to escalate_to_human with a structured handoff (customer ID, root cause, refund amount, recommendation).
isError, errorCategory, isRetryable, and a customer-friendly message. Force a transient failure (random 30% timeout) and observe the agent retry. Force a business-rule failure and observe it explain rather than retry.
- Running
python agent.py "I need a refund on order #1003"walks throughget_customer→lookup_order→process_refundend-to-end without errors. - Asking for a refund of $750 triggers
escalate_to_humaninstead ofprocess_refund, even if you change the system prompt to insist on auto-approving. - Commenting out the prerequisite hook reveals the agent occasionally skipping
get_customer; restoring it makes the violation impossible. That comparison is the lesson. - A 30%-rate transient failure (you can simulate by raising an exception sometimes from
lookup_order) results in retry behavior; a business-rule failure (e.g. "refund window expired") results in the agent explaining to the customer rather than retrying.
Lab 2: Multi-agent research with parallel subagents and provenance
≈ 3 hours · domains 1, 2, 5
Goal: Build the research system from Scenario 3. By the end, you will have lived through the coordinator-decomposition failure mode, the parallel-spawn latency win, and the provenance-loss anti-pattern.
Skip the plumbing, keep the lessons
Coordinator + four AgentDefinitions wired up. Includes both procedural and goals-and-criteria coordinator prompts so you can run the failure case directly. Five fake search results, four fixture documents (two with a planted contradiction on a 2024 vs 2025 statistic, which is a temporal difference rather than real disagreement), and timeable parallel/sequential spawning toggle.
git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab2-research
- Reuse Lab 1's environment. Same
.venv, sameANTHROPIC_API_KEY, same SDK install. Just create a new directorylab2-research/alongside it. - Budget ~$10. Multi-agent runs make more API calls than Lab 1; the parallel-spawn comparison alone runs the system twice.
- Don't try to integrate real web search yet. Fake the
web_searchtool with hard-coded JSON results for one or two seed queries. The exam isn't about search APIs; it's about coordinator decisions and information flow. - Scaffold five files before adding logic:
fixtures/search_results.json: five fake search hits per query, each with{ url, title, snippet, publication_date }.fixtures/documents/: three to five plain-text documents (paste from Wikipedia or a public arXiv abstract). Two should contradict each other on a specific statistic; you'll need that for step 5.subagents.py: fourAgentDefinitions:web_search,doc_analysis,synthesis,report_gen. Each gets a tightallowedToolslist scoped to its role.coordinator.py: the orchestrator. ItsallowedToolsmust include"Task"or it can't spawn anything.main.py: the CLI entry:python main.py "impact of AI on creative industries".
- Pick a research topic that's broader than it looks. "Impact of AI on creative industries" is a great test case because lazy decomposition will miss music, writing, and film. That's exactly the failure mode step 3 forces you to confront.
- Reference docs: the Agent SDK section on subagents and the
Tasktool. Note that subagents do not inherit conversation history; you must pass everything in the prompt.
AgentDefinition with a tight allowedTools set. The coordinator's allowedTools must include "Task".
Task calls in a single response. Time it. Then refactor to sequential and time again. The latency delta is the lesson.
{ claim, evidence_excerpt, source_url, source_name, publication_date }. Synthesis preserves the mapping through to the final report.
verify_fact tool to synthesis- One coordinator request produces a final report citing at least three distinct sources, each tagged with URL and publication date.
- Timing the parallel-spawn version vs sequential: the parallel version should be at least 2× faster on a four-subagent fan-out. (If it isn't, you're spawning across turns instead of in one response.)
- The decomposition failure case: with a procedural coordinator prompt, the report misses obvious topic areas. With a goals-and-criteria prompt, it covers them. Both runs are the lesson.
- A planted contradiction in two fixture documents shows up in the final report as two annotated values with attribution, not one value silently chosen.
- Killing the
web_searchsubagent mid-run produces a final report that explicitly notes coverage gaps, rather than terminating the workflow or silently producing a partial answer marked as success.
Lab 3: Claude Code for a real team workflow
≈ 2 hours · domains 2, 3
Goal: Configure a project the way Domain 3 questions assume it's configured. Pick any repo you have write access to.
CLAUDE.md at three levels
- Root
CLAUDE.mdwith universal team standards - One subdirectory
CLAUDE.mdfor area-specific conventions - Test that running Claude in different directories pulls the right merge of rules via
/memory
.claude/rules/ with globs
- One rule file with
paths: ["**/*.test.*"]for test conventions - One with
paths: ["src/api/**/*"]for API handler conventions - Edit a test file outside the API directory; verify only the test rule loads
Project-scoped /review
- Create
.claude/commands/review.md - Commit it; have a teammate clone and verify it's available
- Compare to a personal command in
~/.claude/commands/
An isolated skill
- SKILL.md with
context: forkand a restrictiveallowed-tools - Add
argument-hintfor required parameters - Verify its output doesn't pollute the main conversation
Project + user MCP servers
- One MCP server in
.mcp.jsonwith${API_TOKEN}expansion - One personal experimental server in
~/.claude.json - Verify both load simultaneously and tools from both are available
Three tasks of varying scope
- A single-file bug fix with a clear stack trace → direct execution
- A library migration touching dozens of files → plan mode
- An open-ended new feature → plan mode (then direct execute the plan)
Lab 4: Structured extraction with validation and batch
≈ 2 hours · domains 4, 5
Goal: Build the extraction system from Scenario 6. By the end, you will have practiced every Domain 4 schema-design decision the exam tests.
Skip the plumbing, keep the lessons
Pydantic schema with optional/nullable, enum + "other" + detail, and an "unclear" enum value. Tool_use call with forced tool_choice. Ten invoice fixtures including five clean ones and five planted edge cases (missing fields, bad totals, informal dates, off-enum category, internal contradiction). Validators and batch are stubbed for steps 5 and 6.
git clone https://github.com/darindeters/claude-architect-labs && cd claude-architect-labs/lab4-extract
- Reuse the environment from Lab 1. New directory
lab4-extract/; same.venv, same API key. Add Pydantic:pip install anthropic pydantic - Pick one document type and stick with it. The exam questions assume you know your corpus; switching types mid-lab burns time. Three good choices, in order of difficulty:
- Invoices (easiest): clear required fields (vendor, line items, totals, dates). Find sample PDFs on the IRS site or invoice-template repos.
- Scientific abstracts: varied structure (inline citations vs bibliographies) is great practice for step 3's few-shot examples.
- SEC 10-K excerpts (hardest): long, heterogeneous, lots of nullable fields. Skip on first pass unless you're comfortable.
- Don't start with the batch API. Get the synchronous flow working on five documents first; step 6 introduces batching only after the schema is reliable. Most lab failures come from people skipping ahead to batching and debugging two layers at once.
- Scaffold four files before extracting anything:
schema.py: Pydantic models. Required fields are exceptions; default to optional/nullable so you can demonstrate the "fabrication when required" failure mode.extract.py: single-documentmessages.createcall withtool_choiceforcing your extraction tool. Returns a Pydantic model.validate.py: semantic validators (e.g.,line_items.sum() == stated_total; conflict detection between fields).batch.py: last. Message Batches client. Don't open this file untilextract.pyis reliable on at least 5/5 documents.
- Sample corpus: ten to twenty source documents in a
fixtures/directory. Plant deliberate problems: one with absent required info (to test step 2's nullable behavior), one with totals that don't add up (step 5), one with informal date formatting (step 3's normalization). - Reference docs: the tool use guide and the Message Batches API reference. Note Batches has no multi-turn tool calling; you can't do validation-retry inside a batch request.
"other" value paired with a detail string field. Use tool_choice set to force this specific tool first.
calculated_total and stated_total; flag discrepancies. Add a conflict_detected boolean for inconsistent source data.
custom_id per request. Calculate the submission cadence to guarantee a 30-hour SLA given the 24-hour processing window. Resubmit any failures by custom_id, chunking ones that exceeded context limits.
extract.pyruns cleanly on at least 8 of your 10 source documents and returns a valid Pydantic model.- The document with deliberately-absent required info: optional fields return
null; flip them to required and re-run, watch the model fabricate plausible values to satisfy the schema. That side-by-side is the lesson. - The validation-retry loop visibly resolves a format error (e.g., date in
"03/14/26"format vs ISO) on the second try. The same loop visibly fails to resolve when the source document genuinely lacks a required field, proving retry isn't magic. - The planted "totals don't add up" document gets caught by your semantic validator and routed to the
conflict_detectedpath rather than passing through silently. - Your batch run on 100 documents (step 6) handles failures by
custom_id: failed documents are identifiable, resubmittable, and you can show the SLA math (24-hour processing window vs your target). - Stratified accuracy measurement reveals at least one segment that performs measurably worse than the aggregate. If everything is uniformly excellent, you didn't plant enough variety in the corpus.
Optional Lab 5: A CI/CD review pipeline
≈ 1 hour · domains 3, 4
If you have the appetite: wire Claude Code into a GitHub Actions workflow with claude -p --output-format json --json-schema, post structured findings as inline PR comments, and re-run on subsequent commits while passing prior findings to avoid duplicates. This makes the Scenario 5 questions feel familiar instead of theoretical.
What is NOT on the exam
The exam guide is explicit about what's out of scope. Don't waste prep time on these:
- Fine-tuning, training, or model weights
- Authentication, billing, account management, OAuth flows, API key rotation
- Deploying or hosting MCP servers (infrastructure, networking, container orchestration)
- Constitutional AI, RLHF, safety-training methodology
- Embedding models or vector database internals
- Computer use (browser automation, desktop interaction)
- Vision and image analysis
- Streaming API implementation, server-sent events
- Rate limiting, quotas, pricing math
- Specific cloud provider configs (AWS, GCP, Azure)
- Performance benchmarking, model comparison metrics
- Prompt caching internals (beyond knowing it exists)
- Token counting algorithms or tokenization specifics
This list is generous. If a prep resource is heavy on any of the above, it isn't aligned with this exam.
Day-of strategy
Reading the question
- The scenario is one of six. Recognize it in the first sentence. Then go to the question.
- Identify which domain is being tested. Each scenario maps to a primary set of domains; the question will be in one of them.
- Look for the "production data shows…" or "logs show…" framing. The setup usually contains the root cause if you read carefully. The right answer addresses what the logs say happened, not what you'd guess.
- Ask the five judgments from the opening section: deterministic-when-required, model-driven-decisions, proportionate-first-response, isolated-subagent-context, structure-beats-summary.
Reading the answers
- Eliminate the always-wrong distractors first. Sentiment routing, self-reported confidence as a routing signal, "use a bigger model," "suppress the error and return success," "terminate the workflow." If any of those appear, scratch them.
- Eliminate over-engineered answers. Separate classifier model, ML pipeline, custom routing layer: these are usually wrong when a smaller mechanism would do. Unless the question explicitly says "we've tried the simple thing and it didn't work."
- Eliminate "addresses the wrong layer" answers. If the problem is tool ordering, an answer about tool availability is wrong. If the problem is decomposition, an answer about the synthesis agent is wrong.
- Of the remaining one or two, pick the smallest mechanism that fixes the root cause.
Time and pacing
- There's no penalty for guessing. Answer everything. An unanswered question is scored as incorrect.
- If a question is taking more than two minutes, mark and move on. Come back with fresh eyes.
- Trust your first read on the pattern questions (Patterns A–G). Second-guessing usually moves you toward over-engineered distractors.
The week before
- Re-read the six scenarios until you can recall each one cold.
- Re-read the sample questions and explanations. The explanations teach the test's grammar.
- Run through your Lab 1 and Lab 2 code one more time. The pattern questions become trivial when the code is fresh.
- Take the practice exam if you have access. It's the highest-leverage hour you can spend.
Closing
This exam tests one thing more than any other: do you have the disposition of someone who has shipped agentic systems to production? The candidate who has lived through a misrouted refund, a coordinator that decomposed too narrowly, or an extraction schema that hallucinated to satisfy a required field: that candidate already knows most of the answers.
The exam is generous to the practitioner and unforgiving to the memorizer. Build the labs. The questions will look like things you've already debugged. — TWD
If you internalize the five judgments, recognize the seven question patterns, and complete the four labs, the rest is reading the question carefully and matching the answer to the smallest mechanism that fixes the root cause. That's the test. Good luck.
Companion reading on the same site: Best Practices covers the model-agnostic disposition this exam tests. Claude Code covers the tool you'll use in Domain 3.