How to Secure AI Agents: IBM and Anthropic's Guide

Key insights
- AI agents must be secured, governed, and audited. These three pillars form the foundation of IBM and Anthropic's enterprise security framework
- Prompt injection is the number one attack type against large language models, and agents amplify the damage because they operate autonomously at speed
- Agents need nonhuman identities with unique credentials, just-in-time access, and role-based access control, just like human users
This article is a summary of Guide to Architect Secure AI Agents: Best Practices for Safety. Watch the video →
Read this article in norsk
In Brief
IBM and Anthropic have released a joint guide on how to architect secure enterprise AI agents using the Model Context Protocol (MCP, a protocol that lets agents communicate with tools and services). Jeff Crume, IBM Distinguished Engineer and CTO of IBM Security Americas, walks through the seven biggest security threats agents face, the design principles that counter them, and a layered security framework covering identity management, AI firewalls, and continuous monitoring. The core message: agents must be secured, governed, and audited from day one.
What is an AI agent?
An AI agent is a system that can perceive context, reason over goals, and take actions through tools and services. Crume describes them as "models using tools in a loop". What makes agents powerful is their ability to operate autonomously, without human intervention. You tell the agent what you want done, and it figures out the details.
But with that autonomy comes risk. Agents need to operate within explicit boundaries, provide observable traces of their decisions, and remain compliant with organizational policies (0:30).
Explained simply: Think of an AI agent like a very capable intern given the keys to the office. They can open doors, use the copier, send emails, and access files. A good intern stays within their role. A bad setup means the intern has access to everything, including the CEO's email and the bank account. Unlike a real intern, an agent works at machine speed, so mistakes happen much faster and on a much larger scale.
The fundamental shift
Crume highlights that agents represent a fundamental shift in how software works (1:22):
- Deterministic to probabilistic: Traditional software always gives the same output for the same input. Agents make dynamic decisions based on probabilities, so identical inputs can produce different outcomes
- Static to adaptive: Agents learn over time. They evolve their behavior based on interaction and human feedback (2:01)
- Code-first to evaluation-first: The focus shifts from writing implementation code to measuring outcomes and checking whether those outcomes move toward the stated goal (2:20)
Three pillars: secured, governed, audited
Crume argues that every AI agent must meet three requirements (0:43):
- Secured: The agent must not leak data or get hijacked by an attacker
- Governed: The agent must be reliable and operate within the context you expect
- Audited: The agent must comply with organizational policies and regulatory requirements
These three pillars shape every layer of the security framework discussed in the video.
The agent development lifecycle
Before diving into threats and defenses, Crume outlines a structured lifecycle for building and managing agents (2:44). This is not a one-time process. It loops continuously.
Step 1 — Plan
Define what the agent should do, what boundaries it needs, and what risks it introduces.
Step 2 — Code
Build the agent, integrating security from the start (not bolted on after).
Step 3 — Test
Verify that the agent behaves within expected boundaries. This completes the "build" phase.
Step 4 — Debug
Identify and fix issues found during testing.
Step 5 — Deploy
Move the agent to production with proper access controls and monitoring.
Step 6 — Monitor
Watch the agent in production. Detect drift, abnormal behavior, and access pattern changes. Then loop back to planning (3:14).
This follows a DevSecOps approach (Development + Security + Operations), where security is embedded throughout the entire lifecycle, not just at the end (3:27). In traditional DevOps, developers and operations teams collaborate. DevSecOps adds security to every stage.
Seven security threats
Crume walks through seven threat categories that AI agents face (4:06):
1. Expanded attack surface
Every new technology expands the attack surface (all the points where an attacker could try to get in). Agents expand it in two directions: the AI model itself, and the MCP protocol that connects agents to tools and services (4:13).
2. Excessive agency
The agent has more access and control than it actually needs (4:30). If an agent only needs to read a database, it should not have write access.
3. Privilege escalation
The agent takes it upon itself to escalate its own privileges (4:43). This means the agent expands its own access rights without authorization, potentially gaining control over systems it was never meant to touch.
4. Data leaks
The agent exposes sensitive data it has access to, either by sending it to the wrong place or by including it in responses that reach unintended audiences (4:48).
5. Prompt injection
Crume calls this the number one attack type against Large Language Models (LLMs) (4:54). Prompt injection is when someone injects commands into the system to take remote control of it. For example, an attacker could embed instructions in a document the agent reads, causing the agent to follow those instructions instead of yours.
6. Attack amplification
Because agents operate autonomously and at machine speed, a compromised agent amplifies damage far beyond what a compromised human account could do (5:10). It doesn't need breaks, doesn't hesitate, and can execute thousands of harmful actions before anyone notices.
7. Compliance drift
Over time, the agent's behavior may drift out of compliance with organizational policies and regulations (5:27). This can happen gradually as the system evolves or as external regulations change.
Explained simply: Think of these seven threats like the security risks of giving a robot free access to a building. It could wander into rooms it shouldn't enter (excessive agency), pick locks to get to higher floors (privilege escalation), accidentally carry confidential documents outside (data leaks), or follow instructions from a stranger who taped a note to the wall (prompt injection). And because it's a robot, it can do all of this faster and more thoroughly than any person could.
System controls and design principles
System controls
Crume describes three types of controls that agents need (5:39):
- Constrained operation: Keep the agent tightly controlled, operating only within expected boundaries
- Role-Based Access Control (RBAC): Assign roles to agents, just like you would with human users. Crume also interprets RBAC as "risk-based access control," where the amount of access matches the risk level of the task (6:02)
- Sandboxing: Have the agent operate in an isolated environment (a sandbox, a restricted space where the agent cannot affect systems outside it). If something goes wrong inside the sandbox, the damage stays contained (6:18)
Design principles
Nine principles guide how to put these controls into practice (6:30):
- Acceptable agency: Define what the agent is allowed to do and what it is not
- Interoperability: The agent must work with many tools, but you need to understand what those tools do and what downstream risks they create (6:47)
- Secure by design: Security built in from the start, not added later. Crume stresses that bolting security on after the fact does not work well (6:55)
- Business alignment: The agent must meet business objectives and align with organizational goals
- Risk mitigation: Minimize the new risk that the agent introduces (7:15)
- Continuous observation: Monitor the agent's reasoning and actions at all times, because it operates autonomously (7:26)
- Key Performance Indicators (KPIs): Track measurable outcomes the business defines to verify the agent performs as expected (7:42)
- Least privilege: The agent gets access to only what it needs to do its job and nothing more. The instant it no longer needs access, that access is removed (7:47)
- Human in the loop: Maintain human oversight for critical decisions (8:13)
The security framework
Crume presents a layered security framework that puts these principles into action. It covers three areas: identity management, data and model protection, and threat detection.
Layer 1: Identity and access management
Agents need identities, just like people do (8:24). Crume covers four components:
- Nonhuman identities: Agents must have their own unique credentials, separate from human users and from each other. If something goes wrong, you need to trace it back to which agent misbehaved (8:42). Just as users should not share passwords, agents should not share credentials
- Just-in-time access: Give the agent permission to do what it needs right now, then revoke that access when the task is done. This can be time-based: a few minutes, a few hours, or a day (9:06)
- Role-Based Access Control (RBAC): Assign roles to agents the same way you assign roles to employees (9:22)
- Auditing: Record everything the agent does so you can review it later and verify that policies were followed (9:35)
Layer 2: Data and model protection
This layer focuses on what sits between users and the AI system. Crume recommends an AI firewall (sometimes called a proxy or gateway) that inspects all traffic to and from the AI model (10:01).
Step 1 — Intercept incoming requests
Instead of letting users or other systems talk directly to the AI, route everything through an AI firewall. The firewall examines requests for prompt injections and other attacks before they reach the model (10:13).
Step 2 — Inspect MCP calls
When the agent talks to external tools via MCP, those calls also pass through the firewall. This catches data flowing out of the system that should not leave (10:32).
Step 3 — Apply data loss prevention
The firewall monitors outbound data for sensitive information. If an agent attempts to send customer records, API keys, or other confidential data through an MCP call, the firewall blocks it (10:46).
Explained simply: Think of an AI firewall like airport security. Every person (request) and every bag (data) must go through the scanner. The scanner checks for prohibited items (prompt injections, sensitive data leaving the system). Without the scanner, anyone can walk straight to the gate. Unlike a real airport where you pass through once, the AI firewall checks both directions: requests going in and data coming out.
Layer 3: Threat detection and monitoring
Even with strong access controls and a firewall, things can still go wrong. Crume's third layer is about catching problems after they happen (11:03):
- Real-time monitoring: Watch what agents do, what tools they call, and what effects their actions have. Set alarms for abnormal behavior: too much data being downloaded, access to unexpected systems, or configuration changes (11:15)
- Threat hunting: This is the proactive counterpart to monitoring. Instead of waiting for an alarm, you imagine hypothetical attacks and go looking for signs of them (11:40)
- Risk assessment: Evaluate what risks the agent system exposes you to. Understand what the agent can do, where its limitations are, and where it might go beyond those limits (12:05)
Crume also highlights three specific things to monitor over time (12:31):
- Configuration drift: Agents performing operations on their own system may change parameters unexpectedly
- Model drift: The AI model's behavior can shift over time, producing different outputs than it did originally
- Access pattern analysis: Track what the agent is doing and whether its actions match expected patterns
Checklist: Common pitfalls when securing agents
- Are your agents sharing credentials? Each agent needs its own unique identity. Shared credentials make it impossible to trace which agent caused a problem
- Does the agent have more access than it needs? Apply the principle of least privilege. Review what roles and permissions are assigned and remove anything unnecessary
- Is security added after the fact? Crume is clear that security bolted on later does not work well. Build it in from the planning stage
- Are you monitoring MCP calls? The connection between agents and external tools is a major attack vector. Route MCP traffic through an AI firewall
- Is there a human in the loop for critical actions? Full autonomy sounds efficient, but it removes the safety net. Keep human oversight for high-risk decisions
- Are you watching for drift? Both configuration drift and model drift can silently degrade your security posture. Set up automated checks
- Do you have just-in-time access policies? Permanent access is a risk. Give agents time-limited permissions that expire when the task is done
Remember: Perfect security does not exist. The goal is to reduce risk to an acceptable level through layers of defense. Even well-designed systems will need ongoing monitoring and adjustment.
Practical implications
For beginners exploring AI agents
Start with the three pillars: secured, governed, audited. Before deploying any agent, even a simple one, ask yourself these questions: what data can it access? What actions can it take? Who reviews what it does? Even a chatbot with access to a customer database needs access boundaries.
For teams building production agents
Follow the DevSecOps lifecycle. Integrate security into planning, not just deployment. Implement an AI firewall or gateway to inspect both inbound requests and outbound MCP calls. Assign nonhuman identities to each agent with just-in-time access rather than permanent credentials. Crume's framework gives a concrete starting point for security architecture reviews.
For organizations managing compliance
The audit pillar is critical. Regulatory requirements around AI are evolving rapidly. Build logging and tracing into your agent infrastructure from day one so you can show compliance. Monitor for configuration drift and model drift, which can silently push a compliant system out of bounds.
Test yourself
- Transfer: The IBM/Anthropic framework was designed for enterprise environments. How would you adapt these principles for a solo developer building a personal AI agent that manages their email and calendar?
- Trade-off: Crume recommends human-in-the-loop oversight for agents. At what point does human oversight become a bottleneck that removes the benefit of using an agent in the first place? How would you decide which decisions need a human?
- Architecture: Design a layered security setup for an AI agent that processes medical records. Which of Crume's seven threats would be your top three priorities, and why?
- Behavior: If organizations implement strict RBAC and least-privilege policies for agents, how might this change the way developers build and test their agents during development?
- Trade-off: Just-in-time access sounds ideal in theory. What are the practical challenges of implementing it when agents need to respond to requests in milliseconds?
Glossary
| Term | Definition |
|---|---|
| AI agent | A system that perceives context, reasons over goals, and takes actions through tools. Think of it as a smart assistant that can independently decide how to complete a task. |
| AI firewall / gateway | A proxy that inspects all traffic going to and from an AI model, checking for attacks and data leaks. Like airport security screening both arrivals and departures. |
| Attack surface | All the points where an attacker could try to break into a system. Adding new tools and connections makes the attack surface bigger. |
| Compliance drift | When a system gradually moves out of alignment with regulations or policies, often without anyone noticing until an audit. |
| Configuration drift | When system settings change unexpectedly over time, either through agent actions or environmental changes. |
| Data loss prevention (DLP) | Monitoring and blocking sensitive information from leaving a system. The digital equivalent of checking that no confidential documents leave the building. |
| DevSecOps | A development approach where security is integrated throughout the entire lifecycle (development, security, and operations), not bolted on at the end. |
| Just-in-time access | Temporary access given only when needed and revoked immediately after. Like a hotel key card that expires at checkout. |
| LLM (Large Language Model) | An AI model trained on large amounts of text that can understand and generate human language. Examples: GPT, Claude, Llama. |
| MCP (Model Context Protocol) | A protocol that allows AI agents to communicate with external tools and services in a standardized way. |
| Model drift | When an AI model's behavior changes over time, producing different outputs than it did originally. |
| Nonhuman identity | A unique set of credentials assigned to an agent, separate from any human user. Ensures each agent can be individually tracked and audited. |
| Principle of least privilege | A security concept where a system or user gets access only to what they need to do their job and nothing more. |
| Privilege escalation | When a system expands its own access rights without authorization, potentially gaining control over resources it was never meant to access. |
| Prompt injection | An attack where someone embeds hidden instructions in input to take control of an AI model's behavior. The most common attack type against LLMs. |
| RBAC (Role-Based Access Control) | A system where access permissions are assigned to roles rather than individuals. An agent assigned the "reader" role can only read, not write or delete. |
| Sandbox | An isolated environment that limits what an agent can do. If something goes wrong inside the sandbox, the damage cannot spread to other systems. |
Sources and resources
Want to go deeper? Watch the full video on YouTube →