AgentOps: Three Layers Your AI Agents Need

Key insights

AgentOps fills the gap between demo and production. Most agent projects don't fail because the agent doesn't work. They fail because no one built the infrastructure to prove it works.
The three layers must be used in order: see what's happening, judge if it's good, then make it better. You can't optimize what you can't evaluate, and you can't evaluate what you can't observe.
The healthcare numbers show agents beat humans on quality, not just speed. A 78% first-pass approval rate versus 52% for manual submissions means the agents write better authorization requests, not just faster ones.

SourceYouTube

Published March 30, 2026

IBM Technology

Hosts:Bri Kopecki

This is an AI-generated summary. The source video may include demos, visuals and additional context.

Watch the video · How the articles are generated

In Brief

Most teams running agents in production are flying blind, says Bri Kopecki, AI Customer Success Engineer at IBM. In this video, she walks through AgentOps (Agent Operations), the discipline of actually managing AI agents once they're live. The framework has three layers: observability (seeing what happens), evaluation (judging if it's good), and optimization (making it better). Without all three, you're not running agents in production. You're just hoping.

From DevOps to AgentOps

Software teams have been building operational frameworks for decades. DevOps (a set of practices for delivering and maintaining software) gave us the tools to deploy software reliably. Then MLOps gave us tools to manage machine learning models. AgentOps is what comes next, built for a new kind of system: one that doesn't just predict, but acts. An AI agent can open tickets, update records, call external APIs, and make decisions. You need to know exactly what it did, why it did it, and whether it should have done it at all.

You cannot improve what you cannot measure, and you cannot measure what you cannot see. That's why AgentOps has three layers, and why the order matters.

Whiteboard showing the three AgentOps layers: Observability, Evaluation, and Optimization with key metrics for each. — Image: Screenshot from YouTube.

Layer 1: Observability

Observability is your visibility layer. If an agent made a decision, you need to reconstruct exactly how it got there. Every tool call, every time it asks the AI model for a response, every handoff between agents.

Three metrics matter most here:

End-to-end trace duration: how long from the moment a user makes a request to the moment they get an answer. This is your headline number. If it's slow, nothing else matters.
Agent-to-agent handoff latency: when one agent passes work to another, how long does that take? In multi-agent systems, these handoffs can pile up and become a hidden bottleneck.
Cost per request: how much does each interaction cost in API calls? This is the metric your finance team will ask about. Know it before they do.

Layer 2: Evaluation

Observability tells you what happened. Evaluation tells you if it was good. Seeing the steps isn't enough. You need to know whether the agent actually did its job correctly.

Three metrics matter most here:

Task completion rate: out of every hundred requests, how many actually completed without a human stepping in? This is the single most important number.
Guardrail violation rate: how often does the agent try to do something it shouldn't, like leak sensitive data or give advice outside its scope? This number should be very small. If it isn't, you have a problem.
Factual accuracy rate: when the agent states a fact (a diagnosis code, a drug dosage, a policy number), is it correct? In regulated industries, this is not optional.

Layer 3: Optimization

Once you can see what's happening and judge whether it's good, you can make improvements that actually stick. Without the first two layers, optimization is guesswork.

Three metrics drive improvement:

Prompt token efficiency: tokens are the word-pieces the AI reads and writes, and you pay for each one. How much output quality are you getting per token? After tuning, you might get the same quality with 40% fewer tokens. That's real money saved on every request.
Retrieval precision at K: K is just the number of documents you ask the agent to fetch (say, 5). When the agent pulls those top documents from a knowledge base, are they actually relevant? If you retrieve five and only two are useful, the agent is working with noise that confuses its answers.
Handoff success rate: when one agent passes work to another, does it succeed? A 98% success rate sounds great until you realize that 2% represents thousands of failed transactions at scale.

What this looks like in the real world

To make these layers concrete, Kopecki walks through a healthcare example: prior authorization. This is the process where an insurance company has to approve a medication before a patient can receive it. Traditionally it takes three to five business days: phone calls, faxes, paperwork, waiting.

Now imagine two AI agents handling this. One pulls clinical records from the hospital system. The other submits the documentation package to the insurance portal and handles the back-and-forth. That three-to-five-day process completes in under four hours, with no human needed 94% of the time.

Sounds impressive. But how do you know it's working correctly?

AgentOps Dashboard showing the healthcare prior authorization workflow with two agents, metrics, and results. — Image: Screenshot from YouTube.

The AgentOps dashboard makes the answer visible. On observability: the average authorization completes in 2.8 hours, down 85% from the manual process. Agent-to-agent handoffs take 340 milliseconds, within the 500-millisecond target. Cost per authorization: $0.47, compared to $25 for a human processing the same request manually.

On evaluation: the task completion rate without human intervention is 94.2%. Diagnosis code accuracy is 99.4%. Lab value accuracy is 99.8%. The guardrail violation rate is 0.8%, and those cases are automatically held for human review. A panel of pharmacists reviewed 5% of submissions independently and rated 97.3% as clinically appropriate.

And then there's the number that makes the real case. The agents are not just faster, they're simply better. The first-pass approval rate is 78% for the agents versus 52% for manual submissions. That means the agents get approved on the first try, with no requests for more information. The agents write better insurance applications.

On optimization: by tuning the prompts from 1,800 tokens down to 1,100 tokens, the team cut costs by 39% with no drop in quality.

The window is narrow

In 2024, $5 billion worth of agents shipped. By 2030, the projected figure is $50 billion. A lot of teams will ship agents. Most will struggle to operate them. The teams that invest in AgentOps early are the ones that will still be running those agents a year from now, confidently and at scale.

Glossary

Term	Definition
AgentOps	The discipline of managing AI agents in production: monitoring what they do, evaluating whether they do it well, and improving them over time.
Observability	The ability to see exactly what a system is doing, step by step. Think of it as a dashcam for your AI agent.
Prior authorization	The process where an insurance company must approve a medication or procedure before a patient can receive it.
Guardrail	A rule that stops an AI agent from doing something it shouldn't, like leaking patient data or giving advice outside its area.
Handoff	When one AI agent passes its work to another agent to continue the task.
Retrieval precision at K	Of the K documents an agent fetches from a knowledge base, how many are actually relevant to the task at hand.
Prompt token efficiency	How much useful output you get per input token sent to the model. Fewer tokens for the same quality means lower cost per request.
EHR (Electronic Health Record)	The digital system hospitals use to store all patient information, including diagnoses, lab results, and treatment history.