Beyond the LLM Monolith: How Enterprises Will Actually Capture AI Value
Executive summary
GenAI adoption is now near universal, but measurable value lags badly. McKinsey reports that 88% of organizations deploy AI somewhere, yet just ~39% see any EBIT lift and a mere ~6% qualify as “AI high performers.”[1] MIT’s latest pulse check is harsher: ~95% of generative AI pilots show no P&L impact at all because they stop at UX experiments and never wire into transactional workflows.[2] Meanwhile, AI sprawl keeps rising—72% of enterprises now juggle overlapping AI tools that increase cost, governance, and integration risk.[3]
The through-line is simple: we treated LLMs as a one-size-fits-all engine for enterprise value. LLMs are phenomenal for language-native tasks—reasoning in natural language, powering interfaces, orchestrating workflows. But the bulk of enterprise value still lives inside structured, relational data (transactions, customers, risk events), and LLMs are neither optimal nor reliable there.[4] Forward-looking teams are therefore splitting the stack:
LLMs provide the reasoning and UX layer. Relational foundation models (RFMs) and other predictive engines provide the “truth layer” over enterprise data.[5]
1. Hype vs. outcomes: the GenAI adoption gap
Adoption is broad, impact is narrow. McKinsey’s 2025 survey shows AI is now everywhere, yet only about one-third of companies have scaled it beyond pockets, and less than half attach it to EBIT outcomes.[1] MIT’s failure data echoes what I see in boardrooms: most deployments are thin wrappers over foundation models. They rarely touch the systems that actually move cash—billing, credit, claims, supply-demand balancing—so they cannot prove impact.[2] Investors, operators, and boards need to stop blaming LLMs and start blaming shallow integration.
2. Why LLMs fail on core enterprise data
Enterprises store durable value inside warehouses and lakehouses, not PDF repositories. Orders, invoices, customer journeys, fraud events, exposure matrices—they are all relational tables with temporal semantics. Yet most GenAI pilots serialize tables into text, stuff them into context windows, and hope the model “reasons” like a BI analyst. Three structural issues make that fragile:
- LLMs lag classic ML on tabular data. Surveys covering dozens of benchmarks show gradient-boosted trees and purpose-built tabular models still beat LLMs on accuracy, calibration, and training efficiency.[4]
- Serialization nukes relational structure. Microsoft’s “Table Meets LLM” benchmark highlights how brittle LLM reasoning becomes when foreign keys, joins, and time windows are flattened into prose; even basic consistency checks degrade once tables exceed context limits.[6]
- Predictive workloads demand calibrated probabilities, not eloquent prose. Fraud detection, churn prediction, or underwriting require supervised learning with reproducible metrics, explainability, and governance. Free-form generation is the wrong substrate.
Jure Leskovec summarizes it well: forcing LLMs to operate directly on relational data is architecturally misaligned with how enterprises encode their most valuable information.[9]
3. The rise of relational foundation models
Relational foundation models (RFMs) address that misalignment. KumoRFM is a useful reference point:[5]
- It maps enterprise databases into temporal heterogeneous graphs (tables → node types, foreign keys → edges).
- A relational graph transformer attends over multi-table, multi-modal inputs (numbers, categories, timestamps, embeddings).
- Users express tasks via Predictive Query Language (PQL)—a declarative way to describe the prediction target while the model handles feature learning.[7]
In Kumo’s RELBENCH results, the RFM beats expert-built baselines by 2–8% on average, and fine-tuning pushes improvements into the double digits.[5] Case studies cite a 5.4× boost in lead-scoring accuracy at Databricks, Reddit’s “largest ads accuracy lift ever,” and a Fortune 500 marketplace projecting ~$100M GMV uplift plus an 80% serving-cost drop.[8]
None of that replaces LLMs. Instead, it pairs them: LLMs become reasoning, explanation, and orchestration layers; RFMs (plus classic ML) become the predictive ground truth.
4. A warehouse-native reference architecture
A resilient enterprise AI stack now gravitates toward six layers:
- Data platform. Snowflake, BigQuery, Databricks, Redshift hold the transactional and event backbone.
- Predictive intelligence. RFMs and established ML models connect directly to warehouse tables to produce calibrated scores and forecasts.[4]
- Knowledge & retrieval. Vector databases + embeddings for documents, policies, specs—fuel for RAG systems.
- LLM / agentic layer. General and domain-specific LLMs for natural-language querying, explanation, and workflow orchestration. McKinsey notes that 62% of organizations experiment with AI agents, but only 23% have scaled them; reliable backends are the missing piece.[1]
- Application & workflow. Internal tools and customer-facing products embedding both predictive outputs and GenAI UX, with human-in-the-loop controls.
- Governance. Central risk, compliance, and measurement capabilities to tame AI sprawl.[3]
When these layers talk to each other, LLM copilots stop hallucinating metrics because they query warehouse-trusted predictions; predictive systems gain a natural-language interface without sacrificing rigor.
5. Investor implications
The commodity curve is catching up with generic “ChatGPT-for-X” UX. Sustainable differentiation is shifting toward data-native infrastructure:
- Ask where the model sits relative to the warehouse. If the answer is “in a separate chatbot that never writes back,” expect shallow value.
- Measure P&L adjacency. How much revenue, margin, or risk exposure do the predictions influence? Can customers run A/Bs and show calibrated lifts?
- Probe time-to-first-meaningful-prediction. Platforms that connect to data and emit validated scores within days will outcompete bespoke consulting-heavy builds.
6. Guidance for founders and technical leaders
- Lead with the problem, not the model. Anchor on churn, fraud, underwriting, demand planning—whatever drives the P&L—and work backward from the tables that matter.
- Use LLMs where they shine. Natural-language to PQL translators, narrative explanations of predictions, auto-generated briefs, and orchestration of downstream tasks.
- Treat RFMs and classic ML as numerical truth. Don’t ask the LLM “who will churn?”—ask it how to act, given a calibrated churn score.
- Design for governance on day zero. Role-based access, audit logs, and unified observability reduce AI sprawl and accelerate procurement.[3]
- Invest in compounding moats. Proprietary data normalization, deep ERP/CRM/core-system integrations, and privacy-preserving cross-customer learning loops are harder to copy than UX polish.
7. Closing thought
The first wave of enterprise GenAI was LLM-centric and interface-driven; it reimagined knowledge work but rarely rewired financial engines. The next wave is data-native and architecture-driven: LLMs for language, reasoning, and agents; relational and domain-specific foundation models for predictions; and a governed, warehouse-centric stack that ties both to verifiable P&L outcomes. That is where durable enterprise AI value will accrue.
References
- McKinsey & Company, “The state of AI in 2025: Agents, innovation, and transformation,” 2025.
- Times of India, “MIT study finds 95% of generative AI projects are failing,” 2025.
- TechRadar Pro, “Tackling AI sprawl in the modern enterprise,” 2025.
- Liu et al., “Large Language Models on Tabular Data – A Survey,” 2024.
- Kumo AI, “Introducing KumoRFM: A Foundation Model for In-Context Learning on Relational Data,” 2025.
- Sui et al., “Table Meets LLM: Can Large Language Models Understand Structured Table Data?” WSDM 2024.
- Kumo AI Documentation, “Understanding Predictive Query (PQL),” 2025.
- Skywork, “Kumo: The Ultimate Guide to Predictive AI on Relational Data,” 2025.
- Jure Leskovec, “Why LLMs Aren’t a One-Size-Fits-All Solution for Enterprises,” Towards Data Science, 2025.
Updated November 25, 2025