You’ve built the agent. But can you trust it with your inventory?
Lukasz
31 March 2026
Deploying an agentic system is the easy part (well sort of, we’ll talk about this at some point also). But knowing whether it’s making the right decisions, consistently, at scale, “under pressure”, is the hard part.
The gap between a top-tier model and a production-grade agent does not show up on an arena leaderboard. It shows up in your monitoring of how well the system follows instructions, and generates outputs, while making hundreds of tool calls without drifting off track. And if your engineering strategy depends on blindly dropping complex behavioral instructions inside a large, fragile system prompt, you are creating avoidable operational risk.
What turns a promising demo into a production-grade system comes down to two highly practical, distinctly unexciting components: harnesses and evaluations, and a robust workflow to run them.
The Expectation: Harness
A raw model is a reasoning brain, and its context window is short-term memory. The harness is the expectation, and the guardrails that maintain sharp focus. It manages the execution lifecycle, curates context, and coordinates tool use.
The Reality Check: Evaluations
How do you know the harness is working? Evals. You cannot ship an agent and a prayer. Actually your evaluation suite is the most valuable AI asset your organization owns. Your clever, hand-coded logic today may be wiped out by the next model update. So your architecture must be modular and easy to replace.
The Practice: Robust Workflow
Build robust atomic tools, let the model plan, and use the harness to enforce the rules under structured test suites that exercise prompts and tools together, then grade the results. Here is how you setup an agent so it survives contact with production:
Step 1: Automatic Workflow Tests
You need automated testing for end-to-end workflows. Break the agent’s responsibilities into features and write fast, cheap assertions. Run these tests async on every agent run.
Step 2: Observable Traces
Log the results. Live feed back to the coordination Agent. Manually review later. After all, if you cannot see what the agent is doing, you cannot fix it. Save traces of user inputs, tool calls, and model responses. Then remove all friction from reviewing them.
Step 3: Strong Feedback Loop
Your tests will fail. That is a feature, not a bug. Use strong models as automated judges to score traces, but align them to human baseline evaluations. When an agent fails an evaluation, that trace becomes the training data or a hint for your next prompt change or model fine-tune.
The Bottom Line: Magic Meets Reality
A frontier model can feel magical. But prompt-heavy systems are inherently unstable, which leads to inconsistent outcomes. The harness, combined with an evaluation of structured testing, observable traces, and clear success criteria, is what makes it safe to put that into reality. If you do not have those mechanisms in place, you are not running an agent. You are running an alchemy experiment on live inventory.