There is a simple question that cuts through most of the noise around enterprise AI. If these tools were delivering the broad, material gains being claimed -- 50 per cent time savings, triple-digit ROI and the transformation of knowledge work -- we would expect to see it showing up clearly and repeatedly in earnings calls and analyst reports. CFOs are required to report material operational improvements. Analysts ask directly about them.

Across the large enterprises most often cited in AI vendor materials and industry reporting, AI now features prominently in earnings call discussions. What is harder to find is clear, quantified attribution of revenue or profit improvement directly tied to AI assistants in those same disclosures.

The most publicised example is a financial technology company that claimed tens of millions of dollars in annualised cost savings from AI-driven customer service. Two years later, the same chief executive said the strategy had gone too far, that cost had become too dominant an evaluation factor, and that service quality had suffered. The company began rehiring the human agents it had replaced. It is one of the most publicly discussed cases of AI delivering measurable financial impact at enterprise scale, and even that case complicates rather than confirms the vendor narrative.

The signal is far weaker than the narrative. But that does not mean the technology does not work. It means the evidence for where it works is more specific, and more conditional, than most of the conversation suggests.

What the evidence actually shows

The clearest signal comes from a 2023 study published in the Quarterly Journal of Economics tracking 5,172 customer service agents over 14 months. Productivity was up roughly 15 per cent on average, and for less experienced workers the figure was 30 to 35 per cent. The mechanism matters more than the headline number. The AI did not replace the agents. It gave every agent access to the knowledge of the best agents, in real time, at the moment they needed it. It raised the performance floor.

That mechanism shows up consistently wherever the independent evidence is strongest. A UK government evaluation of Microsoft Copilot found report summarisation faster and higher quality in small-sample observed task conditions, though the report itself cautions these findings were supplementary rather than definitive. Studies across thousands of developers found meaningful throughput gains on defined coding tasks. A Harvard and MIT study of 758 consultants found tasks completed 25 per cent faster at 40 per cent higher quality, for work within what the researchers called AI's capability frontier. A separate study published in Science testing professional writing tasks across 453 knowledge workers found 40 per cent time reduction and 18 per cent quality improvement.

Where that mechanism is missing, performance breaks down just as consistently. The same Copilot evaluation found Excel analysis slower and less accurate than the control group. An independent randomised controlled trial by METR, published in July 2025, found experienced developers 19 per cent slower on complex work in their own codebases using early-2025 AI tools, while believing they were 20 per cent faster. A large observational study published in JAMA across 8,581 clinicians at five health systems found measurable reductions in EHR time, considerably below the 50 per cent figure vendors had promoted.

The pattern is consistent enough to be useful. AI works reliably when the task is bounded and self-contained, the output is quickly verifiable, the cost of error is low, and the system has access to the context it needs. It breaks down when any of those conditions is missing.

The vendors, assessed honestly

Every major technology company is now an AI company, whether by genuine transformation or aggressive repositioning. The independent evidence across platforms points to a consistent finding: the vendor does not determine the outcome. Deployment conditions do.

Microsoft has the largest enterprise productivity footprint in the world, with 450 million commercial Microsoft 365 seats. The independent evidence on Copilot is not encouraging. Recon Analytics, an independent research firm with no commercial relationship with the vendors it studies, surveyed more than 150,000 respondents between July 2025 and January 2026 and found that only 35.8 per cent of employees with Copilot access use it regularly, against 83.1 per cent for ChatGPT. When employees at organisations with access to all three platforms are asked which tool they choose voluntarily, 70 per cent choose ChatGPT, 18 per cent choose Gemini, and 8 per cent choose Copilot. Recon Analytics measured Copilot's accuracy NPS at negative 19.8 in January 2026. Microsoft is responding: it has adopted a multi-model approach within Copilot and launched Work IQ to pipe Dynamics 365 data into the Copilot experience. Whether these changes are sufficient to address the adoption problem is genuinely uncertain.

OpenAI's ChatGPT Enterprise wins on independent adoption evidence. The Recon Analytics survey documents 83.1 per cent regular workplace usage, dramatically higher than any other platform, and consistent user preference across all independent research reviewed. Its structural limitation is the absence of native integration with enterprise productivity suites: it operates as a separate tool, which limits its access to the specific data context that makes AI genuinely useful for role-specific work.

Google's Gemini was bundled into all Workspace plans at no additional cost in January 2025. Recon Analytics found Gemini accuracy satisfaction scores 23 points above Microsoft Copilot in January 2026. The evidence base for Gemini's specific enterprise productivity impact remains thinner than for other platforms: there is no peer-reviewed or independent controlled study of Gemini's productivity effect comparable to the QJE customer service study or the METR developer RCT.

Anthropic's Claude is gaining enterprise share rapidly. Independent payment data from Ramp Economics Lab, covering more than 50,000 US businesses, shows Anthropic capturing over 73 per cent of spending amongst companies buying AI tools for the first time, as of March 2026. Its most concrete public enterprise outcome is self-reported: a publicly-listed company's earnings call in February 2026 cited up to 90 per cent reduction in engineering time on code migrations. There is no peer-reviewed controlled study of Claude's enterprise productivity impact comparable to the Tier 1 evidence for customer service AI or the METR developer RCT.

GitHub Copilot has the strongest controlled evidence of any AI productivity product in the enterprise market. Multiple large-scale field studies across thousands of developers consistently document 20 to 26 per cent more measurable output on defined coding tasks. Apply the METR independent RCT as a calibrating constraint: experienced developers on complex, contextually rich problems in their own codebases showed a 19 per cent slowdown. Both findings can be true simultaneously. The task structure determines the outcome more than the model quality or the vendor.

The condition most organisations get wrong

The context condition is where many enterprise AI deployments stall, disappoint, or lose trust. AI is only as useful as the data it can see, and in most service and knowledge organisations a significant proportion of the interactions that matter most -- including client calls, customer conversations, field visits and case discussions -- never reach the system of record in structured form. The AI produces generic responses to specific situations. Teams rightly distrust it. The deployment stalls, and the diagnosis is usually wrong: people assume the technology is the problem when the real constraint is data infrastructure.

This is rarely named clearly before purchase, because it requires an honest audit of what your systems actually contain rather than a product evaluation. The strategic question worth asking before any significant AI deployment is straightforward: what proportion of the interactions that define your customer relationships and your service quality are currently captured in structured form that an AI tool can actually access? In the author's experience across client organisations, the honest answer is rarely more than half, and that gap is the real investment target.

The framework

Two conditions determine where any AI use case sits. The first is task type: whether the work is primarily retrieval and assembly, or whether it requires judgment and synthesis. The second is context availability: whether the AI has full access to the information it needs, or whether critical context is missing from the systems it can reach.

Where AI actually works: a two-condition framework quadrant

The most important position is the bottom left, not because it is where organisations should stay, but because it is where most currently are, and because the path forward is clear. Fix the data infrastructure and a bottom-left use case becomes a top-left one. That is not a technology problem. It is an organisational design problem, and most organisations have not yet solved it.

What to do with this

The research and the framework are only useful if the organisation has resolved one prior condition: someone in leadership has made a clear decision, owns the outcome, and is visibly accountable for it. Three recent large-scale surveys make this point with unusual consistency. A KPMG quarterly pulse survey of executives at companies with more than a billion dollars in revenue found that 65 per cent are struggling to scale AI use cases and 62 per cent cite skills gaps as a primary barrier -- both organisational constraints, not technology ones. A Writer and Workplace Intelligence study of 2,400 knowledge workers found that 75 per cent of executives said their company's AI strategy was more for show than actual internal guidance, and only 35 per cent of employees said their manager is an AI champion. A WalkMe survey across 7,150 executives and employees in 14 countries found a 52-point trust gap between executives and workers on AI for business-critical decisions, and found that 93 per cent of AI spending goes to infrastructure, models and tools against 7 per cent invested in the people using them. The consistent finding across all three is not that the technology is failing. It is that leadership is not yet doing what only leadership can do: making the decision, owning the result, and redesigning how work happens around the tool.

In the near term, the most reliably evidenced starting point across all the research is meeting and communication summarisation. It requires no data infrastructure work to begin, and the UK government evaluation found it dramatically faster and higher quality under observed conditions. Pick one team, define the metrics in advance, and evaluate honestly at 60 days rather than relying on user sentiment.

The medium-term priority is the data infrastructure question. Map what proportion of your key interactions are currently captured in structured form your AI tools can access. That audit will likely reveal a significant bottom-left opportunity, and addressing it is what unlocks the use case with the strongest evidence: customer-facing AI that can surface the knowledge of your best people at the moment a colleague needs it.

By year-end, evaluate with actual measured data rather than reported experience. The 39-point gap between perceived and measured productivity in the METR July 2025 developer study is a standing caution against treating positive sentiment as evidence of value. The organisations building genuine AI capability are the ones that measure carefully and expand based on what they find.

A note on people

The research is consistent on something most AI commentary avoids: AI benefits less experienced workers more than experienced ones. The QJE study's 15 per cent average masked a wide distribution, with less experienced workers improving by 30 to 35 per cent and the gain declining with expertise. The independent developer RCT found experienced practitioners slower on their most complex work. The BCG study found that the consultants who performed best without AI saw the smallest improvements with it.

AI raises the performance floor. It compresses the gap between your best people and the rest. It does not raise the ceiling for experienced practitioners on complex, contextually rich work. This is not a reason to reduce investment in great people. It is a reason to take knowledge capture seriously as an infrastructure investment, because the productivity mechanism works only when that knowledge exists in a form the AI can access and surface.

Is this a bubble?

The infrastructure investment is real. Microsoft, Google and Amazon are each committing tens of billions of dollars annually to AI and cloud infrastructure, at a scale with few historical parallels in the technology industry. Whether this produces returns commensurate with its scale is the most consequential unresolved question in enterprise technology right now.

The evidence reviewed in this note points to a clear answer, even if it is not the binary one most commentary reaches for. The technology works in the specific conditions documented here. The customer service productivity gain is real, independently peer-reviewed, and structurally explained. The coding productivity improvements on defined tasks are directionally real across multiple field studies. The clinical documentation improvements are real at the independently measured 10 per cent in the JAMA study across 8,581 clinicians, not the 50 per cent vendors promoted. What is not real, or not yet real, is broad productivity transformation across enterprise knowledge work. The gap between what the independent evidence supports and what vendor marketing claims has been the consistent finding across every platform reviewed here. Bloomberg reported in early 2026 that Microsoft's share price fell sharply in the first quarter, reflecting market repricing of the timeline and scope of AI monetisation rather than abandonment of the underlying thesis.

The organisations building lasting AI advantage are not the ones deploying the most licences. They are the ones measuring most carefully, expanding based on evidence, and building the data infrastructure that makes the mechanism actually work.

The vendor pressure will intensify before it eases. The interest of any serious organisation is in deploying AI where there is independent evidence it works, measuring results with sufficient rigour to know whether it is working in their specific context, and building the institutional capability to evaluate the next generation of products from a position of informed experience rather than hopeful naivety. That is not a conservative position. It is the most commercially sound one available.

On the evidence

This note prioritises independently published and peer-reviewed evidence. Market data and vendor performance indicators draw on external reporting and company disclosures, and are identified in the text. Some conclusions represent synthesis from the evidence reviewed and are offered as my reading rather than established findings.

StudyTierKey finding
Brynjolfsson, Li & Raymond. "Generative AI at Work." QJE, 202515,172 agents. 15% avg productivity gain; 30–35% for less experienced workers.
Noy & Zhang. "Productivity Effects of Generative AI." Science, 20231453 knowledge workers. 40% less time on professional writing tasks; 18% quality improvement.
METR. "Impact of Early-2025 AI on Experienced Developer Productivity." arXiv:2507.09089, July 2025116 experienced developers, 246 tasks. Developers 19% slower with AI; believed they were 20% faster.
Rotenstein et al. "Ambient AI Scribe on Clinician Documentation Burden." JAMA, 202618,581 clinicians, 5 health systems. EHR time reductions considerably below vendor claims of 50%.
Dell'Acqua et al. "Navigating the Jagged Technological Frontier." HBS Working Paper, 20232758 BCG consultants. 25% faster, 40% higher quality within AI's capability frontier. 19% worse outside it.
UK Department for Business and Trade. "Evaluation of the M365 Copilot Pilot." March 20252Report summarisation faster in observed tasks. Excel analysis slower and worse. No organisational productivity improvement found.
Peng et al. "Impact of AI on Developer Productivity: GitHub Copilot." MIT/Microsoft Research, 20232*~55% faster on defined coding task. Small sample, commercially co-produced. Treat as directional.
Recon Analytics. US AI Workplace Survey. July 2025–January 2026. 150,000+ respondents4Copilot regular usage 35.8% vs ChatGPT 83.1%. Copilot accuracy NPS: negative 19.8.
Ramp Economics Lab. Independent payment data, 50,000+ US businesses. March 20264Anthropic capturing 73%+ of spending amongst first-time AI tool buyers.
KPMG Quarterly Pulse Survey, Q1 2026. Executives at companies >$1bn revenue365% struggling to scale use cases; 62% cite skills gaps. Both organisational constraints.
Writer and Workplace Intelligence. 2,400 knowledge workers, US and Europe, 2026375% of executives said AI strategy was more for show than actual guidance. Only 35% said manager is an AI champion.
WalkMe (SAP). State of Digital Adoption. 7,150 executives and employees, 14 countries, 2026352-point trust gap between executives and workers. 93% of AI spending on infrastructure; 7% on people.
McKinsey Global AI Survey, 2025388% of companies using AI; majority report limited or no measurable EBIT impact.
Klarna press release and Bloomberg/CEO reporting, 2024–254Company claimed annualised savings; CEO later said cost focus had damaged quality. Began rehiring human agents.

Tier 1: peer-reviewed or independent RCT  ·  Tier 2: structured evaluation with disclosed methodology  ·  Tier 3: consulting survey  ·  Tier 4: company disclosures and journalism  ·  *Vendor involvement noted

Paul Mason is an independent adviser working with leadership teams on AI strategy, commercial growth and organisational change. paulmasonconsult.com