When a senior AI engineering interview opens with “design an agentic system that does X,” the interviewer is testing whether you have a structured way to decompose the problem, whether you have shipped one of these before, and whether you can reason about trade-offs without picking a side. Your answer needs a clear seven-part structure, real failure-mode thinking, and the discipline to stop and let the interviewer steer.
This post is the playbook for that round. If you have not read Agentic AI Interview Questions: 30 Real Questions with Production Answers, start there for the full interview surface. This piece zooms in on the single question that decides most senior loops.
Material here mirrors the Interview Bootcamp chapter of Designing Enterprise Agentic AI Systems, which covers each section in depth with worked examples from real engagements.
What the interviewer is actually asking
The prompt is “design an agentic system.” The unspoken version is “prove to me you have done this and would not blow up our infra.”
Senior interviewers in 2026 use this question as a stress test on three orthogonal signals:
- Structured decomposition. Can you carve the system into the right components without missing the boring ones (the tool registry, the memory layer, the eval harness, the observability stack)?
- Production scar tissue. Do you talk about failure modes (loops, tool misuse, prompt injection, cost runaway) the way someone who has been paged at 2 a.m. talks about them, or the way someone who read a blog post talks about them?
- Trade-off reasoning. When the interviewer pushes on ReAct vs Plan-and-Execute, single vs multi-agent, framework vs raw Python, do you give a religious answer or a contextual one?
Almost every follow-up question maps to one of those three. If your answer hits all three on the first pass, the rest of the round becomes a conversation, not a viva.
The seven-part structure
You are going to answer in seven parts. Memorize the headers, not a script. The headers give you something to navigate back to when an interviewer pulls you down a branch and you need to surface again.
- Restate the problem and the constraints.
- Define the agent’s job and its tools.
- Sketch the architecture.
- Pick the agent pattern.
- Address reliability, cost, and evaluation.
- Address safety and bounded autonomy.
- Name what you would build first.
That ordering matters. Candidates who skip step 1 design for a problem the interviewer did not ask about. Candidates who skip step 7 sound like architects, not operators. Each section below covers what to say, why it matters, and the trapdoors.
1. Restate the problem and the constraints
Take 60 seconds. Repeat the prompt in your own words and surface the constraints the interviewer did not state.
A good restatement names: the user, the goal, the inputs the agent receives, the outputs it must produce, and the operating environment (interactive vs batch, public vs internal, sync vs async). Then you ask two or three clarifying questions about volume, latency, cost ceiling, and risk tolerance.
Example, for the prompt “design an agentic system that helps customer-support engineers resolve tickets”:
“So we have a CSE who picks up an inbound ticket, and the agent should help them resolve it faster, ideally with a draft response or a recommended action. Inputs are the ticket text, the customer’s account context, and prior tickets. Output is either a draft reply for the CSE to review or a recommended next step (escalate, refund, create a bug). Before I go further: are we optimizing for time-to-resolution, deflection rate, or CSE satisfaction? And what is the cost ceiling per ticket, roughly?”
That paragraph does three things at once: it shows you can listen, it surfaces the trade-off you are about to make, and it gives the interviewer a chance to redirect before you spend 20 minutes on the wrong problem.
The trapdoor: do not ask twelve clarifying questions. Two or three, then commit. Interviewers grade decisiveness too.
2. Define the agent’s job and its tools
State the agent’s job in one sentence, then list its tools.
“The agent’s job is to read a ticket, gather context, and either draft a reply or recommend an action. To do that it needs: a ticket search tool, an account-lookup tool, a knowledge-base search tool, a refund-policy lookup, an escalation tool, and a draft-reply tool. Each tool has a typed schema, an auth scope, and a timeout.”
This is the section where most candidates blur “the agent” and “what the agent can do” together. Keep them separate. The agent’s job is a sentence. The tool registry is a list.
Two things to call out explicitly here:
- Tool schemas are typed contracts, not free text. If you describe tools as “the model just calls them,” you signal you have not built one. Real tools have typed parameters, validation, timeouts, retries, and rate limits.
- Tools include reads and writes, and you should treat them differently. A read tool is cheap to retry. A write tool (refund, escalate, send-email) needs guardrails: idempotency, human approval, or both. Naming this distinction unprompted moves you up a band.
For the deeper version of how function calling actually works under the hood, see The Agent Loop, Explained.
3. Sketch the architecture
Draw seven boxes. Connect them with arrows. Talk while you draw.
+---------+ +------------------+ user --> | planner | <----> | tool registry | +---------+ +------------------+ | | v v +---------+ +------------------+ | memory | <----> | executor | +---------+ +------------------+ | | v v +------------------+--------+--------+ | observability | safety | +------------------+-----------------+The seven things you must name:
- User / surface. Where the goal enters the system. CLI, chat UI, Slack bot, API call.
- Planner. The model plus the system prompt. Picks the next action.
- Tool registry. Typed schemas + policies (timeouts, rate limits, auth scopes). The single source of truth for what the agent can do.
- Executor. Boring orchestration code. Looks up the tool, calls it, feeds the result back into context.
- Memory. Short-term (context window, summarized) and long-term (vector store, structured store, key-value store). Always written through explicit tools, never silently.
- Observability. Traces, costs, token usage, tool latency, agent steps per task. Logged per run, queryable per cohort. If you cannot answer “what did the agent do for user 73 yesterday at 3:14 p.m.” in under two minutes, you do not have observability, you have hope.
- Safety / bounded autonomy. Step limits, time limits, cost limits, repeat-detection, human-in-the-loop for write tools above a threshold. The thing that keeps the agent from doing something expensive or stupid.
Many candidates name three or four of these and stop. Naming all seven, with the boring ones (registry, observability, safety) called out as first-class components, is the senior signal.
4. Pick the agent pattern
Now you commit. The interviewer wants to hear you choose, justify, and acknowledge the alternative.
The three patterns to know cold:
- ReAct. Single Think-Act-Observe loop. The model reasons, picks one tool, observes the result, reasons again. Right for short, exploratory tasks where the next step depends on the previous one.
- Plan-and-Execute. A planner emits a multi-step plan up front; an executor runs each step, with optional re-planning on failure. Right for long-running tasks where steps are mostly independent and you want predictable cost or parallelism.
- Planner / tool-registry / executor with memory. The reference architecture above. ReAct or Plan-and-Execute both sit inside this; it is the chassis, they are the gearbox.
For the support-engineer example, the right answer is something like:
“I would start with ReAct. The next tool the agent needs depends heavily on what the previous tool returned, ticket lookup might surface a refund question or an account-locked question, and those paths diverge. I would consider Plan-and-Execute for the batch backfill case (e.g. nightly re-summarization of old tickets), where the steps are predictable and we want to parallelize across thousands. In production I would expect both to coexist.”
That answer is correct, defensible, and shows you can think across task classes. For the deeper version of the pattern trade-off, see ReAct vs Plan-and-Execute.
The trapdoor: do not pick multi-agent here. If the interviewer specifically asks “would you use multi-agent?” then say “only if I can name the specific gain (parallelism or prompt-size reduction) and I am willing to pay the cost and failure-surface premium.” Defaulting to multi-agent on a single-agent problem is the most common reason senior candidates fail this round.
5. Reliability, cost, and evaluation
This section is where the production-experience signal lives. Interviewers know that anyone can name “ReAct.” Fewer candidates can sketch a credible eval loop.
Cover three things:
Reliability. What does the agent do when a tool times out? When the model returns a malformed tool call? When the loop runs more than N steps without converging? Name the strategies: retries with exponential backoff for transient failures, schema validation with a single re-ask for malformed calls, a hard step-and-cost ceiling that surfaces the run to a human if hit.
Cost. Token cost per run, in expectation and at the tail. Where the cost lives (system prompt, conversation history, tool results echoed back into context). The strategies you would reach for: prompt compression, smaller models for routing or summarization, caching, capping how many tool results stay in context.
Evaluation. This is the one most candidates fudge. Be specific. You need two layers:
- Offline evals. A frozen test set of ~50 to ~200 representative tasks with reference outcomes. Run on every prompt change, every model upgrade, every tool change. Track win-rate, cost per task, average steps. This is your CI.
- Online evals. A small fraction of production traffic graded by an LLM judge (with periodic human spot-checks), plus a feedback loop from the human operator (the CSE in our example). This is your monitoring.
If you can name one specific eval metric that maps to the problem (e.g. “first-draft acceptance rate by CSE,” or “policy-violation rate from the LLM judge”), you are at the top of the band.
6. Safety and bounded autonomy
Bounded autonomy is the part that gets you hired in regulated environments. Cover four levers:
- Step / time / cost limits. Hard ceilings per run. Above them, surface to a human.
- Tool-level policies. Per-tool auth scopes, rate limits, idempotency keys for writes, allow-lists for the domains a fetch tool can hit.
- Human-in-the-loop for high-stakes writes. Refunds above $X, account closures, anything irreversible. The agent prepares the action; a human approves it.
- Prompt-injection defenses. Treat tool results as untrusted user input. Strip or escape suspicious instructions. Never let a tool result rewrite the system prompt. If you say nothing else about safety, say this; it is the bar in 2026.
A short, blunt sentence that lands well with interviewers: “The agent should be able to do the right thing 95% of the time. The other 5% is what the bounds are for, and the bounds matter more than the planner.”
7. What you would build first
End with a build order. Two or three milestones, each one a thing you could ship and learn from.
Example for the support-engineer system:
“Week one: a read-only version. The agent has search tools and the draft-reply tool but no write tools. CSE sees the draft and discards or edits it. We measure draft acceptance rate. Week three: add the refund and escalate tools, with human approval. We measure approval rate and override rate. Week six: turn off human approval for refunds under $X, after we have enough data to set the threshold defensibly.”
This section converts a design into a project. Interviewers grade it because it is the one part of the answer that demonstrates you would actually be the person delivering this.
The most common rejection-grade mistakes
Across the loops I have run and watched, four patterns get candidates rejected from this round, in roughly this order of frequency:
Designing multi-agent by default. Already covered. Worth restating: almost every prompt is one agent plus a thoughtful tool registry. Multi-agent is the answer to a specific question (parallelism, prompt-size reduction), not the default architecture.
Skipping evaluation entirely. Candidates who wave at “we would have tests” without naming a test set, a metric, or a frequency get downgraded. The eval loop is not an appendix; it is the part that distinguishes a demo from a system.
Naming components without naming what is inside them. “We have memory” is not an answer. “We have short-term memory in the context window, summarized at 8k tokens, and a long-term store, vector for semantic recall and key-value for user preferences, written through explicit tools, never silently” is an answer.
Talking for ten minutes without pausing. The agentic system-design interview is supposed to be a dialogue. If the interviewer has not said anything for five minutes, you are failing, not winning. Pause every two to four minutes. Ask if they want depth on any branch. Let them steer.
The compressed answer (if you only have 30 seconds)
Sometimes the question is a pre-screen and you have a minute, not 40. Memorize this:
“I would design it as a single agent with a typed tool registry, ReAct loop by default, planner-tool-executor-memory chassis, with observability and bounded autonomy as first-class components. I would pick the pattern based on whether steps depend on each other (ReAct) or are independent (Plan-and-Execute). I would not reach for multi-agent unless there is a specific parallelism or prompt-size gain. The eval loop is offline win-rate on a frozen test set, plus online LLM-judge spot-checks. Bounded autonomy is step, time, cost limits, plus human-in-the-loop for high-stakes writes. The thing I would build first is a read-only version to measure draft-acceptance rate before any write tool ships.”
That paragraph is dense on purpose. Every sentence hits one of the seven sections. If you can deliver it under your breath in 45 seconds, you can deliver any version of this answer.
Practicing the round
Three drills, in order of how much they will move the needle:
- Run the seven-part structure on three different prompts. Pick one from the Agentic AI Interview Questions hub (the customer-support agent, the research agent, the code-review agent) and time yourself. The first run will be 20 minutes. By the third, you should be at 8 to 12 minutes for the spine, leaving room for the interviewer’s branches.
- Build one read-only agent end to end. Even a small one. Tools, memory, the loop, basic logging. The single fastest way to make the production-experience signal real. Build an AI Agent From Scratch in Python walks through this without a framework.
- Practice the trade-off questions out loud. ReAct vs Plan-and-Execute. Single vs multi-agent. Framework vs raw. The candidates who fluff these are the ones who have only read about them. Saying the words out loud, three or four times, is what makes them sound earned.
If you internalize the seven-part structure and add real build experience underneath it, this round becomes the easiest part of the loop. The candidates who fail it are not the ones who do not know the material. They are the ones who never imposed a structure on their answer.
The interviewer is rooting for you to impose one. Make it easy.
Frequently asked
Quick answers
- What is the interviewer actually testing when they say "design an agentic system"?
- Three things, in order: do you have a structured way to decompose an agent system (not just naming components), have you operated one of these in production (the reliability, cost, and failure-mode questions catch this), and can you reason about trade-offs (ReAct vs Plan-and-Execute, single vs multi-agent, framework vs raw) without sounding religious. The prompt is the door; those three are the room.
- How long should my answer be?
- Plan for 35 to 45 minutes of total airtime, but never monologue. Talk for 2 to 4 minutes, pause, ask if the interviewer wants more depth on any branch, and let them steer. The candidates who fail this round are usually the ones who try to recite a 30-minute lecture without breathing. Interviewers cut you off and downgrade the signal.
- Should I draw a diagram?
- Yes. The seven boxes (user, planner, tool registry, executor, memory, observability, safety) plus arrows is enough. Skip the pretty boxes; use a whiteboard, Excalidraw, or even just a numbered list shared in chat. The point is a shared visual that you can both point at when you trade off between options. A candidate who designs only in words burns interviewer working memory.
- Do I need to mention specific frameworks (LangGraph, CrewAI, AutoGen)?
- Mention them by name to show you know the landscape, but never lead with them. The right answer is the architecture and the trade-offs. Frameworks are an implementation detail you justify late in the discussion ("for the planner, I would reach for LangGraph because the state-machine model fits this control flow; I would skip it if the agent is single-turn"). Leading with the framework signals shallow experience.
- What is the single biggest mistake candidates make in this round?
- Designing a multi-agent system on instinct. Almost every prompt is solvable with one agent and a thoughtful tool registry. Jumping to "I would have a research agent, a writer agent, and a critic agent" multiplies cost, latency, and failure surface, and signals that you have read about multi-agent demos but not operated one. The senior move is to start with one agent and only add agents when you can name the specific gain (parallelism or prompt-size reduction).