Why 95% Of AI Pilots Fail—And What The 5% Do Differently

The statistic haunts every enterprise software buyer: 95% of AI pilot programs fail. MIT published this finding, and it’s been weaponized in every procurement meeting since. But after interviewing founders who’ve built AI companies serving millions of users across healthcare, communications, and food tech, I’ve discovered something the research missed.

The companies succeeding aren’t just building better AI. They’re building fundamentally different types of companies—and the distinction determines everything.

Build Co-Pilots, Not Autonomous Agents

Davit Baghdasaryan has a perspective most AI founders lack. As CEO of Krisp, his voice isolation technology powers over a billion minutes per month of voice bot traffic. That gives him unique visibility into what’s actually working across the AI landscape.

His analysis of the 95% failure rate is specific: “I believe that 95% represents—the majority of it is represented in the AI agent space.”

“There are two types of AI products today,” Baghdasaryan explains. “There are AI agents which act autonomously, especially customer-facing ones. And then there are co-pilots or assistants that assist humans.”

The distinction seems simple, but the underlying dynamics are profound. AI agents operate independently—voice bots answering customer calls, chatbots handling support tickets, AI systems scheduling appointments. Co-pilots augment human capabilities—noise cancellation during calls, note-taking during meetings, documentation drafting while you see patients.

“We see it with our customers. The growth curve is not what I would expect with customers who are building voice bots,” Baghdasaryan notes. “Because we haven’t really figured it out because of different technological limitations. They are just not able to kill it yet.”

The technological limitations aren’t about raw capability—modern language models can often match human performance on narrow tasks. The problem is consistency and error recovery. An autonomous agent that’s right 95% of the time still fails catastrophically 5% of the time, and those failures happen in front of customers without human intervention.

Meanwhile, co-pilots follow a completely different adoption curve. “That usually doesn’t fail. These pilots don’t fail because there is not much risk associated with this. CIOs don’t take huge risks to deploy this because it’s a productivity story.”

But there’s a deeper reason: “Also it doesn’t ruin your customer conversations because there is a human behind it.”

This reveals the asymmetry. When a co-pilot makes an error, a human catches it before it reaches the customer. The error becomes a minor friction point in an internal workflow. When an agent makes an error, it happens in front of a customer, potentially damaging a relationship that took months or years to build.

Autonomous agents succeed in narrow domains with clear success metrics and low trust barriers—fraud detection, algorithmic trading, spam filtering. These work because errors have limited human consequences and success is easily quantifiable. But for customer-facing applications where judgment matters and mistakes are visible, the error tolerance threshold is fundamentally different.

Baghdasaryan is direct: “I do believe that we haven’t crossed the chasm yet for AI agents and the fastest growing products in AI are the ones that are co-pilots.”

Solutions that make humans demonstrably better get adopted faster than solutions that replace humans entirely—not because of bias against automation, but because the error tolerance requirements are asymmetric.

Go Deep in One Vertical—Because Depth Creates Compounding Network Effects

Mike Ng built Ambience Healthcare to tackle clinical documentation—one of the primary drivers of physician burnout. But his approach reveals something crucial about how competitive advantages form in AI.

“Every specialty has its own job to be done, medicine being delivered, workflow and how they get paid,” Ng explains. “The reality is that when you’re building Ambience, you’re not building one product, you’re building more like 100 different products.”

Ng’s team serves the Cleveland Clinic —an institution housing over 100 different specialties. “In many cases, these are the hardest of the hardest patient cases that other institutions have tried to solve.”

The specificity required is extreme. Take oncology: “70% of visits are follow-ups. And in a follow-up, you actually spend a lot more time looking through past charts, labs, scavenging through past EHR information.”

A horizontal AI solution would just transcribe conversations. But Ambience recognized that “by that time you could have actually written a big part of the note before the patient even comes in.” This required building patient summary tools that digest historical records before appointments begin.

“Unless you actually are there as a part of that workflow to help the clinician, you’ll have a clinician-generated note and you have an ambient listening note. And then you have this challenge of how do you merge the two together,” Ng notes. “Ambient listening in of itself is an insufficient product.”

Here’s where it gets strategically interesting. Most enterprise software gets harder to scale across verticals—each new industry adds complexity without creating leverage. But Ambience discovered something different: vertical depth creates compounding advantages.

“Each one of these pieces, from Chart Chat to Patient Summary to Ambient listening, works together under a single language model with a single infrastructure that works better when they’re all together.”

The mechanism is subtle but powerful. Each specialty teaches the underlying model something about medical reasoning, documentation patterns, and clinical workflows. Oncology requires understanding treatment progressions and response patterns. Cardiology requires different temporal reasoning about chronic conditions. Pediatrics requires understanding developmental milestones and family communication dynamics.

These don’t fragment the model—they enrich it. The 50th specialty is easier to serve than the 10th because the infrastructure has absorbed more variation in medical reasoning. The depth compounds because the model isn’t learning 100 separate tasks—it’s learning the deep structure of how medicine works across contexts.

While competitors focus on feature parity in one specialty, Ambience builds infrastructure that makes entering the next specialty easier than the last. The depth becomes the moat, not despite its complexity but because of it. A competitor trying to catch up can’t just match features—they need to replicate the entire compounding learning curve.

This only works if you’re solving a problem with deep structure—patterns that transfer across contexts within a domain. Not every market has this property. But when it does, vertical depth creates network effects that horizontal breadth never can.

Make Trust Architectural, Not Procedural

Krish Ramineni built Fireflies into one of the first AI note-takers to achieve mass adoption. His company now serves millions of users with a distributed team of 120 people globally. But early resistance was significant.

“It was definitely a bigger challenge in the beginning when the Fireflies brand was not as well known and they would be like, ‘What’s a note-taker? What’s this bot on a meeting? Is it spying?'”

Most companies respond to trust concerns with policies—privacy statements, compliance certifications, terms of service. Ramineni’s team took a different approach: they made trust legible through architecture.

“Having the note-taker there creates a lot more transparency because there are some tools that are recording silently in the background,” he explains. “I think people are a lot more pissed when they don’t know that there’s something there that is capturing those meetings. At least you can see the participant and kick it out.”

This isn’t just good product design—it’s understanding how trust actually forms in professional contexts. Procedural trust requires reading policies and believing promises. Architectural trust requires only observing how the system works.

The distinction scales differently. Procedural trust breaks down as soon as someone questions the policy or the company’s incentives. “How do I know they’re really deleting my data?” Architectural trust is self-evident: you can see the bot, you can kick it out, the mechanism is transparent.

Ramineni also built institutional trust through similar architectural decisions. “We don’t train on customer data. By default, you own your data, you can have your data wiped, all of those sorts of things. The other part is we offer private storage for the enterprise tier. You can have the data stored on your own servers or in a private storage container.”

Each of these isn’t just a feature—it’s a trust mechanism that doesn’t require believing company promises. Private storage means the data physically never touches Fireflies’ servers. Opt-in training means the system can’t learn from your data unless you explicitly enable it. These are architecturally guaranteed, not procedurally promised.

The transformation in user perception validates this approach. “I’ve had customers that have told me, ‘I was anti-Fireflies note-taker. And now I cannot go to a meeting without having it. It gives me anxiety knowing that I’m in a meeting without a note-taker.'”

This psychological flip—from resistance to dependence—happens because architectural trust enables actual behavior change. Once users trust the system enough to try it, the utility becomes undeniable. But they never get to experience the utility if trust requires reading privacy policies.

The companies that scale fastest build trust into their architecture, not their terms of service. They recognize that in enterprise contexts, skepticism is the default. The product that wins isn’t the one with the best promises—it’s the one where trust is observable in how the system fundamentally works.

Hire for Speed, Not Skills

Alon Chen runs Tastewise, an AI platform for the food and beverage industry serving Fortune 500 brands with 110 people. His company pivoted from a data solution to a workflow solution—the kind of transformation that destroys most teams.

“The title and the new moat, if you like, for success is speed,” Chen says. “For a startup especially, speed is your only moat today.”

But his definition of speed subverts conventional wisdom. “What keeps me up at night is thinking and sensing and analyzing and being reflective on different departments and seeing if we are actually moving fast.”

He’s not measuring velocity of shipping features. He’s measuring organizational restlessness—the capacity to identify stagnation and mobilize change before it calcifies.

“Different people have different skills. And so the people that you build the company with are not necessarily the ones that can stick around and help you grow the company.”

This sounds harsh until you recognize the underlying dynamic. In stable industries, companies optimize for accumulated expertise. In rapidly evolving industries, expertise has a half-life. What you learned about AI two years ago is partially obsolete. What you learned about go-to-market in a pre-AI world doesn’t fully apply now.

Chen operationalizes this in hiring. “We made sure that the people we’re hiring are extremely excited about change. Because today it’s Gen AI, tomorrow is going to be something else. Yesterday was just machine learning. Today is Gen AI and then the world is moving into an agentic world. And I want my team to be in the mindset of change.”

But there’s a tension Chen navigates that’s easy to miss: restlessness without memory becomes chaos. You need people who can adapt quickly, but you also need institutional memory about what’s been tried and why it failed.

During Tastewise’s pivot, Chen didn’t just replace the entire team. Some people adapted; others didn’t. The ones who stayed weren’t just the most skilled—they were the ones who could hold both the old context and the new direction simultaneously. They could say “We tried something similar two years ago and here’s why it failed” while also saying “But here’s what’s different now that might make it work.”

This is the sophisticated version of hiring for restlessness: You want people who are energized by change but disciplined about learning from history.

The principle extends beyond hiring to organizational structure. During the pivot, Chen deliberately disrupted team structures, changed KPIs, and reassigned people. But he also maintained certain stable elements—customer relationships, core technical infrastructure, cultural values. Complete disruption creates chaos. Strategic disruption creates adaptability.

“The change mindset and then the change management and being in a smaller company versus a bigger company with the ability to disrupt” became the core capability.

The companies that survive multiple technological shifts don’t just tolerate change—they metabolize it. They build organizations where adaptation is the baseline expectation, but where institutional learning compounds rather than resets with each change.

Obsess Over Utilization Metrics

Mike Ng’s team at Ambience obsesses over granular usage metrics in ways that seem excessive until you understand why they matter differently for AI than for traditional software.

“We create shared dashboards so that our health system partners can actually see what our weekly active users are, monthly active users and utilization rates—not just at the specialty level, but at the visit type level.”

Traditional software has binary adoption: either people use it or they don’t. AI has continuous adoption: people use it, but with varying levels of trust and integration into their actual workflow.

This visibility reveals problems aggregated data hides. “A while ago we had this challenge where in these pediatric well visits we had a low utilization rate.”

The team investigated and discovered something unexpected: “A lot of times the clinicians were providing care, but they weren’t documenting it and they weren’t getting credit for it.”

The specific issue: “If a mother brings in their son for a well child visit, they may often say, ‘Hey, can you also look at that rash?’ They would take care of that acute complaint but not write a separate piece of documentation because they didn’t have time. But in reality they get credit for what we call a modifier 25.”

This is crucial: the AI was technically working—it was transcribing conversations accurately. But it wasn’t creating value in the way that mattered to clinicians. AI failures often aren’t technical failures. They’re value alignment failures.

Traditional software breaks obviously—features don’t work, systems crash, errors appear. AI degrades subtly—it works technically but doesn’t fit the workflow, or it solves the wrong aspect of the problem, or it creates new friction elsewhere in the system.

Granular utilization metrics catch these problems early. If pediatric well visits have 40% utilization while oncology follow-ups have 85% utilization, something’s wrong—not with the technology necessarily, but with how it fits the workflow.

Krish Ramineni at Fireflies tracks similar behavioral signals. The shift from “I don’t want this bot here” to “I feel anxious without it” doesn’t happen by accident, and it doesn’t show up in traditional adoption metrics. You have to measure not just whether people use your product, but how they feel about using it—and more importantly, how they feel when they can’t.

This level of granularity matters more in AI because the gap between technical capability and actual value is wider than in traditional software. Traditional software either works or doesn’t. AI can work perfectly from a technical standpoint while creating zero value—or worse, creating friction that offsets its benefits.

The companies that scale AI successfully don’t just track adoption—they track value creation at the most granular level possible. They identify where the AI is genuinely transforming workflows versus where it’s just adding a layer of complexity. And they kill or redesign features based on utilization patterns, not just user feedback.

What Actually Matters

The patterns across these companies reveal a more sophisticated playbook than most AI startup advice suggests:

Build co-pilots first because error tolerance is asymmetric. Autonomous agents fail not because the technology isn’t ready, but because customer-facing errors are exponentially more damaging than internal workflow friction. Start where humans can catch mistakes.

Go deep in verticals that have transferable structure. Not all domains compound—some just fragment. But when deep patterns transfer across contexts within a domain (like medical reasoning across specialties), vertical depth creates network effects that horizontal breadth never can.

Build trust architecturally, not procedurally. Make trust observable through how the system works (visible controls, on-device processing, private storage) rather than what the privacy policy promises. Architectural trust scales; procedural trust doesn’t.

Hire for restlessness, but preserve institutional memory. In fast-moving markets, adaptive capacity matters more than existing skills. But balance disruption with continuity—you need people who learn from history while embracing change.

Track utilization granularly because AI failures are silent. AI can work technically while creating zero value. Measure adoption at the workflow level, not just the product level. Identify where you’re transforming work versus where you’re just adding complexity.

The AI revolution isn’t about having the most sophisticated models. It’s about understanding how organizations adopt new capabilities when trust is uncertain, stakes are high, and the gap between technical capability and actual value is wider than ever before.

The 5% that succeed recognize that operational discipline, psychological insight, and organizational adaptability—not just technical sophistication—create defensible businesses. They’re not just building better AI. They’re building companies structured to survive the continuous disruption that AI itself creates.

I write about AI, performance management and the future of work for Forbes. I’m the founder of Mandala, an AI coaching platform.

What's Hot

Why 95% Of AI Pilots Fail—And What The 5% Do Differently

Build Co-Pilots, Not Autonomous Agents

Go Deep in One Vertical—Because Depth Creates Compounding Network Effects

Make Trust Architectural, Not Procedural

Hire for Speed, Not Skills

Obsess Over Utilization Metrics

What Actually Matters

Keep Reading

News

Mobile Apps

Subscribe to Updates