AI / Agents

Stop Building Agent Demos. Start Building Agentic Software.

July 3, 202610 min readAIAgents

Agents should handle intelligence; code should handle reliability. Most production failures happen when we confuse the two.

There is a phase every new technology goes through where the demo is more exciting than the discipline. Agentic AI is in that phase right now. It is very easy to build a workflow that looks magical in a notebook: one prompt searches, another prompt summarizes, a tool call publishes something, and suddenly we have an "agent."

But production does not care about magic. Production cares about repeatability, observability, cost, latency, safety, debugging, and whether the system behaves the same way on Friday evening when a customer is waiting. That is where the real engineering work starts.

The paper I read, "A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows," is valuable because it pushes the conversation away from prompt theatrics and back toward software engineering. The authors use a multimodal news-to-podcast workflow as their case study, but the lessons apply broadly to enterprise automation, RAG systems, voice AI, workflow products, support automation, compliance agents, AI copilots, and multi-agent platforms.

My takeaway is blunt: agentic AI is not a replacement for software architecture. It is a new ingredient inside software architecture. If we put the LLM in charge of everything, the system becomes unpredictable. If we use the LLM only where reasoning is genuinely useful, the system becomes powerful without becoming chaotic.

The core thesis: do not make everything an agent

A lot of agentic systems fail because they overuse agency. They give the model too many tools, too much responsibility, and too many opportunities to improvise. That may look impressive in a prototype, but it becomes a liability once the workflow has side effects: writing to databases, calling paid APIs, sending emails, creating pull requests, changing production configuration, or publishing content.

The better mental model is to treat the LLM as a reasoning component inside a deterministic workflow. The orchestrator owns the process. Normal code owns side effects. Agents own judgment-heavy transformations.

User goal
  -> Workflow Orchestrator
  -> Deterministic preprocessing
  -> Agent: classify / reason / transform
  -> Validator: schema, policy, grounding
  -> Pure function: API / DB / file operation
  -> Observability trace
  -> Final artifact or action

That framing changes how we design the system. We stop asking, "How many agents can we chain together?" and start asking, "Where does reasoning add value, and where does it introduce avoidable risk?"

This is the shift from agent demos to agentic software.

MCP is an interface layer, not your business logic

The paper spends significant attention on MCP, the Model Context Protocol. MCP is important because it standardizes how tools and external systems can be exposed to AI clients. It is useful when you want a client such as a desktop assistant, IDE, internal tool, or agent runtime to discover and invoke capabilities through a common protocol.

But the paper also surfaces a practical problem: if your core workflow depends on an agent reasoning through MCP tool metadata for critical execution steps, you may introduce unnecessary ambiguity. The model has to understand the tool list, select the right tool, infer parameters, and handle protocol-specific behavior. That is a lot of cognitive overhead for something that might be a simple function call.

The paper's GitHub pull request example is a good illustration. Creating a pull request is not a reasoning problem. It is a software operation. The branch, title, body, file set, and destination repo should be explicit. Putting an agent in the middle of that operation increases the surface area for failure.

Less reliable:
Agent -> MCP server -> tool discovery -> GitHub PR

More reliable:
Workflow controller -> createGithubPullRequest(payload)

My interpretation: use MCP to expose capabilities. Do not let MCP become the internal architecture of your product. The workflow backend should own the business process; the MCP server should be a thin adapter.

Use direct function calls when reasoning is not needed

This is the most important engineering principle in the paper. Tool calls are structured, but they are still LLM-mediated. The model still has to decide to call the tool, format the arguments, and interpret the result. That is useful when the operation requires language reasoning. It is wasteful when the operation is deterministic.

Database writes, timestamps, file generation, queue scheduling, API calls, retries, commits, notifications, and storage operations should usually be regular code. They should be typed, tested, logged, retried, and monitored like the rest of your backend.

The litmus test is simple: if a junior engineer could write the exact function signature, the LLM should probably not be deciding how to call it.

Reserve agents for:
- classification
- summarization
- extraction
- planning
- reasoning
- transformation
- judgment

Use code for:
- database writes
- API calls
- retries
- auth
- file storage
- timestamps
- queueing
- publishing
- billing and metering

This distinction is how we keep the workflow intelligent without making it fragile.

Do not give one agent too many tools

A multi-tool agent sounds flexible, but flexibility is not always the same as reliability. Every additional tool increases the decision space. The model has to decide which tool to use, when to use it, in what order, with what parameters, and how to recover when one step fails.

In prototypes, this may work. In production, it often creates intermittent failures: the agent skips a tool, calls tools in the wrong order, uses a similar-but-wrong tool, or produces a response that looks successful even though the side effect never happened.

A better default is single-responsibility agents with narrow tool access. One agent should have one clear role and, when possible, one clearly relevant tool.

Bad:
Content Agent
  tools: search, scrape, summarize, publish, validate, save

Better:
Search Agent      -> search
Scrape Function   -> scrape page deterministically
Summary Agent     -> summarize
Validator         -> check output contract
Publish Function  -> publish artifact

This is not dogma. Some agents can safely use two or three tightly related tools. But the burden of proof should be on complexity. Start narrow, then expand only when the evals prove it is safe.

Single-responsibility agents are just good software design

The paper's Veo example is a useful pattern. One early design asked a single agent to create a video-generation JSON spec and also "generate" the video. That blends planning with execution. The result is predictable: malformed JSON, mixed prose and data, imagined file paths, and fake execution status.

The fix is to split responsibilities. Let an agent produce a structured JSON plan. Then let deterministic code validate that JSON and call the actual video-generation API. The agent designs. The system executes.

VeoJsonBuilderAgent
  input: final script
  output: valid JSON only

validateVeoJson(json)
  output: typed, safe payload

generateVideo(payload)
  output: actual file, actual trace, actual status

This pattern applies everywhere. A compliance agent should not both interpret policy and send the customer notification. A recruiting agent should not both evaluate a candidate and update hiring status. A support agent should not both diagnose an issue and issue a refund without deterministic guardrails. Separate reasoning from irreversible action.

Prompts are production assets

One of the most practical recommendations in the paper is to store prompts externally and load them at runtime. This is more than cleanliness. It is operational maturity.

Once prompts influence production behavior, they deserve the same discipline as configuration and policy. They need versioning, review, rollback, access control, testing, and traceability. A prompt hidden inside a random service file is not maintainable. A prompt registry is.

/prompts
  /agents
    search-agent.md
    filter-agent.md
    reasoning-agent.md
    json-builder-agent.md
  /policies
    grounding.md
    safety.md
    tone.md
    escalation.md
  /evals
    regression-cases.jsonl

External prompts also make collaboration easier. Product, legal, domain experts, safety reviewers, and engineers can work on behavior without coupling every wording change to an application deployment. But external prompts are not enough by themselves. They must be paired with typed outputs, validators, evals, and runtime tracing.

Multi-model reasoning is a robustness technique, not a silver bullet

The paper recommends a model-consortium pattern: multiple models generate independent drafts, and a reasoning agent consolidates them. I like this pattern when the output is important enough to justify the cost and latency. It can reveal disagreements, reduce dependence on a single model's failure mode, and make the final answer more robust.

But we should be careful with the marketing language. Multi-model reasoning does not automatically make a system responsible, factual, or safe. It is a robustness technique. You still need source grounding, policy checks, factual verification, audit trails, and human review for high-stakes actions.

Model A draft
Model B draft
Model C draft
      -> Reasoning / Auditor Agent
      -> Grounded final output
      -> Validator + trace

Use this pattern selectively: legal summaries, medical-adjacent content, executive reports, financial analysis, code migration planning, compliance workflows, or public publishing. Do not use it blindly for every chat response.

Separate the workflow backend from the MCP server

A clean production architecture keeps the workflow backend independent from the interface used to invoke it. Your core workflow should be available through a stable API. MCP, CLI, Slack, web UI, internal admin tools, and scheduled jobs can all call that API.

This gives you one source of truth for business logic and many adapters at the edge.

Claude Desktop / IDE / Internal Client
            -> MCP Server
            -> Workflow API
            -> Orchestrator
            -> Agents + Functions + Validators
            -> Logs, traces, metrics, artifacts

This also makes testing easier. You can run workflow tests against the API without needing MCP. You can version the API. You can secure it. You can scale it. You can observe it. MCP remains useful, but it is not the place where the product logic should live.

Deploy agentic systems like real backends

Agentic workflows are still backend systems. They need containers, environment management, secrets, queues, retries, tracing, metrics, logs, rate limits, access control, and cost controls. Whether you use Kubernetes on day one depends on your scale and team maturity, but the operational concerns are unavoidable.

For early products, a managed container platform plus a queue may be enough. For enterprise workflows, long-running jobs, parallel agents, model gateways, custom tool services, and heavy observability, Kubernetes becomes attractive. The important part is not Kubernetes itself. The important part is treating the system as production software.

Recommended production components:
- Workflow API
- Job queue / workflow engine
- Agent workers
- Model gateway
- Prompt registry
- Tool services
- Artifact storage
- Secrets manager
- Observability stack
- Evaluation pipeline

The best agentic architecture is usually boring

The paper ends with KISS: keep it simple. This matters because agentic AI invites abstraction addiction. It is tempting to build a meta-agent router, dynamic planner, tool resolver, agent factory, memory graph, and orchestration DSL before the product has one reliable customer workflow.

Resist that temptation. A readable function that calls five explicit steps is often better than a generic framework that nobody can debug. Production systems fail in the gaps between components. The more magical the orchestration, the harder it is to reason about those gaps.

async function runWorkflow(input) {
  const sources = await searchAgent.run(input.topic);
  const filtered = await filterAgent.run(sources);
  const pages = await scrapePages(filtered);
  const drafts = await generateDrafts(pages);
  const final = await reasoningAgent.run(drafts);

  await validate(final);
  await publishArtifact(final);

  return final;
}

That is not simplistic. That is maintainable.

A practical blueprint I would use

If I were building a production agentic workflow today, I would start with this architecture:

Client
  -> API Gateway
  -> Workflow API
  -> Orchestrator
      -> Deterministic preprocessing
      -> Small specialized agents
      -> Schema validators
      -> Pure functions for side effects
      -> Policy checks
      -> Artifact storage
      -> Trace + metrics + logs
  -> Final response / action

The design principles would be:

Every agent has a written contract: input, output, allowed tools, failure behavior, and evaluation cases.
Every side effect is executed by deterministic code, not by model improvisation.
Every prompt is versioned and loaded from a prompt registry or repository.
Every model output that matters is validated against a schema or policy.
Every workflow run produces a trace that explains what happened.
Every expensive or risky path has retries, limits, and human-review options.
Every abstraction must earn its place by reducing operational complexity.

My opinionated rules for production agentic AI

Decision	Default Choice	Reason
Need language judgment?	Use an agent	Reasoning is where LLMs add value.
Need to write to DB/API/file?	Use code	Side effects must be deterministic and testable.
Need external client interoperability?	Expose MCP adapter	MCP is great at the edge.
Need core workflow logic?	Use backend API/orchestrator	Business logic should be stable and observable.
Need higher confidence output?	Use multi-model + auditor selectively	Improves robustness but costs more.
Need prompt updates?	Use external prompt registry	Versioning and rollback matter.
Need scale and isolation?	Use containers; Kubernetes when justified	Ops maturity should match system maturity.

Closing thought

The next wave of agentic AI products will not be won by the teams that create the longest chains of agents. It will be won by the teams that know where not to use agents.

The mature pattern is not "LLM everywhere." The mature pattern is a disciplined system where agents reason, validators constrain, functions execute, observability explains, and humans can intervene when the risk is high.

That is how we move from impressive demos to dependable agentic software.

References

Bandara, Eranga, et al. A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows. arXiv:2512.08769, December 2025. See especially Section 3 on best practices for production-grade agentic workflows.

agentic-ai agents mcp production software-architecture llm-orchestration deterministic-workflows ai llm

← Back to all posts