We Analyzed 100 AI Agent Failures

Clyro Content Team
We Analyzed 100 AI Agent Failures
Table of Contents
Table of Contents

Research status: This article picks out 100 incidents from a dataset of more than 500 documented AI agent failures that happened between 2023 and 2026. The full 591-incident dataset is where all of the percentages come from. The 100 curated incidents are examples that tell the story, but the numbers come from the full research. To get the full dataset with source links, failure-mode tags, sector breakdowns, and cost estimates. We will publish the complete report shortly.


Quick Answer

What:
An analysis of more than 500 documented AI agent failures from 2023 to 2026 shows that 88% of these failures can be traced back to infrastructure gaps, not model quality. The five failure modes, in order of how common they are: Context Blindness (31.6%), Rogue Actions (30.3%), Silent Degradation (24.9%), Memory Corruption (8.1%), and Runaway Execution (5.1%).

Why it matters: The industry is working on better models, but most failures in AI deployments can be avoided with runtime governance, step limits, permission checks, cost ceilings, session isolation, and often a human-in-the-loop. Most fixes ship in under a week.

Key finding: The infrastructure-gap pattern described above held consistently across sectors, incident types, and time periods in our dataset.

We looked at more than 500 AI agent failures and picked the 100 that best tell the story. Most teams could make the changes in a week, and they aren't very exciting.


The Crisis That No One Can Measure

There is probably an agent in your stack right now doing something you wouldn't like. It could be quoting a return policy that you changed last month. It could be stuck in a retry loop that has been going on since Tuesday. You won't know until a customer complains or the API bill comes.

We kept track of more than 500 events like this. The numbers aren't good. S&P Global reports AI project failure rates doubled from 17% in 2024 to 42% in 2025 (projects dropped by businesses surveyed). Gartner expects ~40% of agentic AI projects specifically to be abandoned by 2027 (a forward-looking estimate driven by rising costs and unclear ROI). RAND estimates 80% of AI projects overall fail to reach deployment (across all AI, not just agents), and MIT's NANDA lab puts it at 95% for GenAI pilots that show no material P&L impact (a narrower bar: not failure to launch, but failure to move the bottom line). The numbers measure different things, but the direction is consistent. Most of them die without making a sound.

Those numbers are always in pitch decks, but they almost never say why agents fail. The main topic of the conversation is model quality, as if the only thing that matters is whether an LLM like GPT-5 will be smart enough. We thought the same thing until we looked at the numbers.

Over several months we put together more than 500 documented failures spanning 2023 to 2026. Court filings, published incident reports, GitHub issues, LinkedIn, and developer forums. There were costs that could be checked for each incident. We sorted them all by root cause and then picked 100 based on their source quality, impact, narrative clarity, and uniqueness. The model itself usually worked well. The failure started with the infrastructure around it.


The Surprising Distribution

The Breakdown

Two-level infographic: the top bar shows that 88% of agent failures are due to infrastructure gaps and about 10% are due to model quality. The bottom section breaks down all 591 incidents into 5 failure modes ranked by how common they are, adding up to 100%.
Failure mode 591 Dataset (2023–2026) Share Root cause type
Context Blindness 187 incidents 31.6% Mixed (model + infrastructure)
Rogue Actions 179 incidents 30.3% Infrastructure
Silent Degradation 147 incidents 24.9% Infrastructure
Memory Corruption 48 incidents 8.1% Infrastructure
Runaway Execution 30 incidents 5.1% Infrastructure

We thought hallucination would win. The data said something different. Rogue Actions at 30.3% often involved models that correctly understood instructions and did what they were told without checking for permission first. Silent Degradation at 24.9% is the most worrying because the outputs look right while the accuracy goes down. Runaway Execution at 5.1% is rare, but it costs the most per incident.

No cost ceilings. No permission boundaries. No freshness checks. No session isolation. The model worked well enough, but the overall agent behavior was flawed. There was nothing around it.


The Three Most Common Types of Failure

Context Blindness: The Most Common Cause at 31.6%

A customer who had just lost a loved one was told by Air Canada's chatbot that he could apply for bereavement fares up to 90 days after the death. The airline's real policy was the opposite — no changes after the fact, period. He bought tickets at full price based on what the chatbot said, but he didn't get the discount. He took the case to a civil resolution tribunal and won.

Air Canada said that the chatbot was a "separate legal entity" that was responsible for what it said. The tribunal flatly turned that down. Companies are responsible for what their agents say. The direct cost in money was low. The legal precedent was huge.

Without hesitation, the agent works with context that is incomplete, out of date, or made up. In seven other incidents, agents quoted discontinued prices, referenced retired product SKUs, or applied expired promotional terms — each time presenting the outdated information as current fact. Three additional incidents involved session bleed: one user's order history, payment details, or conversation context surfacing in another user's session. In every case, the model did what it was supposed to do in that situation. There were problems with the pipeline, not the model.

What stood out to us was how sure these agents were when they acted on wrong information. No hedging, no signal of uncertainty. The agent treats old data the same way it treats new data. You only find out about the problem when a customer, a regulator, or a court tells you about it.

The fix: validate context freshness before the agent acts on it. Scope data access to the correct user session.

Rogue Actions: 30.3% of Incidents

Rogue Actions made up only 18% of our original 100-incident sample. The full dataset shows it's essentially tied with Context Blindness.

The McKinsey "Lilli" breach (MI-0010) is the most shocking one. An autonomous AI red-team agent found 22 API endpoints lacking proper authentication, exploited a JSON-key SQL injection that standard scanners missed, and within two hours gained full read/write access to McKinsey's systems — reportedly exposing 46.5 million plaintext chat messages and 728,000 confidential files. [Source: The Register, March 2026]

Agents ran destructive database operations with write access they didn't need in more than one case. The Amazon Q Developer incident (MI-0063): a supply chain attack added bad code to the official Microsoft VS Code extension, which had over 964,000 installs. The code was flagged before it was run. Another pattern was scope creep, where agents used tools they weren't supposed to to finish tasks they thought they had to, expanding their functions beyond their design. The $3.2M AI procurement fraud (MI-0008) shows how much it costs.

The fix isn't new in a technical sense. Permission checks on every tool call, least-privilege defaults, confirmation gates before destructive operations. This is solved engineering. The agent ecosystem just hasn't adopted it.

Silent Degradation: 24.9% of Incidents

The most scary type because you can't see it. The outputs always look like they could be real, making it difficult to track with traditional metrics, including latency.

Klarna (MI-0137) replaced 700 customer service workers with AI, announced success, then quietly reversed course when customer satisfaction dropped. Pinterest's moderation system (MI-0138) flagged quilting magazines and Minecraft builds as pornography — drift affecting millions (Pinterest attributed the errors to an "internal error," not AI moderation). In healthcare, AI algorithms (MI-0460) encode existing racial disparities — Black patients face 28.7% higher mortality rates, and AI systems scale the gap rather than correcting it.

Three months of degraded math accuracy — from 97.6% to 2.4% — and nobody knew until researchers tested it. The fix: quality monitoring, tracking accuracy over time, setting baselines, alerting on drift. Even a weekly accuracy check against a golden dataset catches this erosion. [Source: Stanford/UC Berkeley, July 2023]


OWASP Agentic Top 10 Validation

The OWASP Agentic Applications Top 10, published December 2025 by 100+ experts, lines up directly with our five failure modes, offering crucial insights for multi-agent systems:

Clyro Failure Mode OWASP ASI Code OWASP Category
Context Blindness (31.6%) ASI06 Memory & Context Poisoning
Rogue Actions (30.3%) ASI02 Tool Misuse & Exploitation
Silent Degradation (24.9%) ASI09 Human-Agent Trust Exploitation
Memory Corruption (8.1%) ASI06 + ASI03 Memory Poisoning + Identity & Privilege Abuse
Runaway Execution (5.1%) ASI08 Cascading Failures

Our taxonomy came before the OWASP framework. The alignment validates both: we found the same failure categories in incident data that OWASP found through threat modeling. [Source: OWASP Agentic Applications Top 10]


The 10 Most Expensive Incidents

Table showing the 10 most expensive AI agent incidents with corrected data
Rank Incident (MI#) Year Primary mode Impact
1 McKinsey "Lilli" breach (MI-0010) 2026 Rogue Actions 46.5M chat messages exposed, 728K confidential files
2 Stability AI cloud bill crisis (MI-0550)⁴ 2023 Runaway Execution $1M underpaid, $7M owed to AWS
3 CrowdStrike faulty update (MI-0401) 2024 Runaway Execution $5B global outage, ~5,000 flights cancelled on day one¹
4 $3.2M AI procurement fraud (MI-0008) 2026 Rogue Actions $3.2M in fraudulent procurement
5 PGI Global AI trading Ponzi (MI-0139)³ 2025 Silent Degradation $198M, 90,000 victims
6 Air Canada chatbot tribunal ruling (MI-0171) 2024 Context Blindness Legal precedent: companies liable for agent statements
7 Klarna AI reversal (MI-0137) 2025 Silent Degradation Replaced 700 workers, dropped customer satisfaction
8 $47,000 API cost spiral (MI-0112) 2025 Runaway Execution $47K in API charges (11-day loop)
9 Meta AI mass account deletions (MI-0140) 2025 Silent Degradation 10M accounts removed in automated spam sweep (Meta stated intentional, affected users disputed)
10 Nvidia $593B market cap wipeout (MI-0141) 2025 Market impact² Single-day $593B market cap loss

¹ CrowdStrike is a borderline case — the faulty Falcon sensor update was a normal software QA failure, not a pure AI agent failure. We include it because the rapid-release deployment pipeline skipped standard testing, which shows what happens when automated systems don't have enough safety measures. The commonly cited "25,000 flights cancelled" figure likely represents cumulative disruptions; major outlets reported ~5,000 on day one.

² The Nvidia market cap loss was not caused by an AI agent failing, but by competitive repricing after DeepSeek's launch. We include it as a market-impact incident to show the economic stakes.

³ PGI Global marketed itself as "AI-powered trading" but investigations found no evidence of actual AI systems — it was a pure Ponzi scheme using AI as a marketing wrapper. We include it because the $198M loss was enabled by investor trust in "AI agent" claims, which shows how the agent hype cycle creates fraud surface area.

⁴ Stability AI's cloud bill crisis was primarily a business and financial management failure rather than a pure agent runaway. We classify it as Runaway Execution because the uncontrolled compute scaling mirrors the pattern — costs spiraling without bounds — even though the root cause was organizational rather than a single agent loop.

Runaway Execution is the rarest mode (5.1%) but produces the highest per-incident financial damage. Rogue Actions creates the most structural damage — breaches, unauthorized transactions, supply chain compromises. Context Blindness poses the highest legal risk. Silent Degradation is what worries us most: failures operating at scale without triggering alerts.


What the Data Says About Prevention

Matrix showing which prevention controls work for each of the five agent failure modes

The Layer That Isn't There

Every time you make a request to a web app, it checks your identity. Databases keep transactions separate. Container orchestrators limit how much resources can be used. Most of the time, agent architectures have a prompt, a retry counter, and a log file. That is the governance stack that most agents use. It's not even close to enough.

The gap is a straightforward engineering problem with known solutions: step limits, cost ceilings, tool scoping, context validation, accuracy tracking, session isolation, and managing external dependency. None of this requires Clyro. Most can be put into action in a week.

Priority Action Addresses Implementation effort
1 Audit tool permissions (least-privilege, confirmation gates) 30.3% of failures (Rogue Actions) 1-2 days
2 Validate context freshness and scope data per session Majority of 31.6% Context Blindness 2-3 days
3 Build quality monitoring (accuracy baselines, drift alerts) 24.9% of failures (Silent Degradation) 3-5 days
4 Isolate session state between users 8.1% of failures (Memory Corruption) 2-3 days
5 Add execution bounds (step limits, cost ceilings, loop detection) 5.1% of failures (Runaway Execution) — rare but highest per-incident cost Hours (three config values)

The most powerful thing you can do is audit your tool permissions. If your agent can keep retrying, this will happen to you. It's only a matter of time.

import clyro

wrapped = clyro.wrap(your_agent, config=clyro.ClyroConfig(
    agent_name="data-pipeline-agent",
    controls=clyro.ExecutionControls(
        max_steps=100,              # Stop after 100 tool calls
        max_cost_usd=10.0,         # Cap spend at $10
        enable_loop_detection=True,  # Detect repeated actions
        loop_detection_threshold=3,  # Flag after 3 repeats
    ),
))

Those four controls — applied with a single clyro.wrap() call — would have prevented the $47,000 incident entirely.

We need to be honest about what we can and can't do. Runtime governance handles most of the failures in our dataset. It doesn't handle all of them. A meaningful minority are genuinely rooted in model quality. We don't think fixing infrastructure solves all of the industry's problems — we're saying it fixes most of the ones the industry is ignoring while it argues over which foundation model is best.

Building this full stack from scratch takes two to four engineering-weeks. If you'd rather not build it yourself, Clyro's Agent Kernel wraps your existing agent with execution bounds, permission boundaries, and quality monitoring. The Prevention Stack provides these controls as defaults.

Get Started with Clyro →


How We Did It

Source tier Why we trusted it Approximate share*
Court filings Legally verified, hard to dispute ~4%
Published incident reports / news Named companies, editorial review ~38%
GitHub issues Reproducible technical evidence ~27%
Community forums (Reddit, Discord, Hacker News) Pattern confirmation, not standalone ~31%

*Approximate proportions based on our initial 100-incident sample. Distribution across the full 591-incident dataset is similar but not identical.

Every incident required: an AI agent with real autonomy (not an API call or simple bot), at least one verifiable source, real harm (financial loss, data destruction, legal consequences, or user injury), and a plausible root cause. We left out hypothetical situations, chatbot mistakes that didn't have any effects, and events that were clearly caused by human deployment errors.

This is not a random sample. The public record is biased toward serious events. The real failure rate is probably higher because the severity distribution is not normal, especially in high-stakes scenarios and for complex edge cases. That skew helps us figure out what goes horribly wrong, but it doesn't help us guess how often regular agents give slightly wrong answers.

We tagged each incident against a five-mode failure taxonomy aligned with our companion article on agent failure modes:

Failure mode What it covers Example
Context Blindness Agent works with wrong, missing, or fabricated context Chatbot invents a refund policy and acts on it
Memory Corruption Agent state is corrupted across sessions or users Customer A sees Customer B's order history
Rogue Actions Agent takes unauthorized or destructive steps Agent runs DROP DATABASE on production
Runaway Execution Unbounded loops, retries, or cost spirals $47,000 in API charges over eleven days
Silent Degradation Quality erodes without triggering alerts GPT-4 math accuracy dropped from 97.6% to 2.4% over three months

Each incident got one primary tag. Everything in this analysis uses primary tags only.


Methodology Notes

An infographic showing a four-tier method for gathering and checking 100 curated agent failure incidents

The full dataset contains 591 documented incidents (2023–2026), assembled through four research loops. This article picks 100 incidents based on their source quality, impact, narrative clarity, recency, and uniqueness using a hybrid sampling method (80 proportional + 20 flagship). All percentages come from the full 591-incident dataset. The infrastructure-gap finding (88%) is derived from keyword-based root cause classification, not per-incident expert review — treat it as a strong directional signal rather than a precise measurement.

Complete Research Report

We will publish the complete report shortly→ (get notified when it's ready →)


Frequently Asked Questions

What are the most common AI agent failure modes?

Five modes, ranked by how often they happened in 591 incidents: Context Blindness (31.6%), Rogue Actions (30.3%), Silent Degradation (24.9%), Memory Corruption (8.1%), and Runaway Execution (5.1%). The vast majority trace to infrastructure gaps — missing guardrails, permissions, and monitoring — not model quality. See the breakdown above.

Why do AI agents fail in production?

As noted in our key finding above, the overwhelming majority of classifiable failures came from the systems around the model — missing permission checks, no execution bounds, no context validation, no quality monitoring. Only ~10% trace to model capability. The model usually worked correctly; nobody built the governance layer around it. The OWASP Agentic Top 10 independently validates this finding.

What is the cost of an unmonitored AI agent?

Ranges from a $47K API spiral to a $5B outage (CrowdStrike) to a 46.5M-record breach (McKinsey "Lilli"). Context Blindness makes companies legally responsible for what their agents say, as shown by the Air Canada case. Silent Degradation could be the most expensive in the long run. See the 10 most expensive incidents above.

How do I prevent AI agent failures?

Five actions in priority order: audit tool permissions (30.3%), validate context freshness (31.6%), build quality monitoring with drift alerts (24.9%), isolate session state (8.1%), add execution bounds (5.1%). Most teams can ship the first three within a week. See the full action table above.

Should I build agent guardrails or buy them?

Basic execution bounds ship in a few hours. Permission auditing and context validation take 1-3 days each. When it comes to build-vs-buy: accuracy tracking, drift detection, cross-session isolation, and real-time intervention across multiple agents need purpose-built infrastructure. Building from scratch takes 2-4 engineering-weeks plus ongoing maintenance.

What data sources were used in this agent failure analysis?

Four tiers: court filings (~4%), published incident reports (~38%), GitHub issues (~27%), community forums (~31%). Every incident required an autonomous AI agent, at least one verifiable source, real harm, and a plausible root cause. We left out hypothetical situations and mistakes made by people during deployment. See methodology above.

What is the difference between agent hallucination and agent failure?

Hallucination — making up information — is one type of agent failure, but only ~10% of our dataset traced to model limitations. The rest of the problems happen when the model works correctly but gets bad input or doesn't have any guardrails: stale data, unbounded loops, broad permissions, gradual quality erosion, or state leaking between sessions. Different root causes, different fixes — infrastructure, not prompt engineering.


Related Resources

Sources

[1] S&P Global Market Intelligence — "Voice of the Enterprise: AI & Machine Learning, Use Cases 2025": 42% of businesses dropped most AI projects in 2025, up from 17% in 2024. 1,006 IT and LOB professionals were surveyed between October and November 2024. https://www.spglobal.com/market-intelligence/en/news-insights/research/ai-experiences-rapid-adoption-but-with-mixed-outcomes-highlights-from-vote-ai-machine-learning

[2] Gartner — "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027" (June 25, 2025). Reasons: rising costs, unclear business value, and poor risk management. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027

[3] RAND Corporation (>80% of AI projects fail, RR-A2680-1, August 2024) and MIT NANDA Initiative (95% of GenAI pilots show no material P&L impact, August 2025). https://www.rand.org/pubs/research_reports/RRA2680-1.html

Great! Next, complete checkout for full access to Clyro
Welcome back! You've successfully signed in
You've successfully subscribed to Clyro
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated