Beyond the LLM Monolith: How Enterprises Will Actually Capture AI Value

I've been in enough boardrooms and architecture reviews over the past two years to notice a pattern. The AI deployments that get celebrated internally are almost always the ones built on top of document stores and email threads — the knowledge work, the summarization, the Q&A over PDFs. The ones nobody talks about at all-hands meetings are the ones that were supposed to wire into the real financial machinery: billing, credit, churn, risk. And in most organizations, those are either still in "proof of concept" or quietly dead.

The data on this is uncomfortable. McKinsey's 2025 survey shows that roughly 88% of organizations have deployed AI somewhere, but only about 39% have attached it to measurable EBIT outcomes, and a mere 6% qualify as what they call "AI high performers."[1] MIT's research is harsher: around 95% of generative AI pilots show no P&L impact because they stop at interface experiments and never connect to transactional workflows.[2] I believe these numbers. I've seen them up close.

The problem isn't the models. The problem is that we've been using LLMs as if they were a universal solvent — capable of dissolving every enterprise data problem if you just prompt them right. They're not. And the organizations that figure this out early are building something more durable than a chatbot wrapper.

Why LLMs Fail on the Data That Actually Matters

Here's the thing I keep running into when reviewing enterprise AI architectures: the data that drives the most important decisions isn't text. It's tables. Orders, invoices, customer journeys, fraud events, exposure matrices — all of it lives in relational databases with temporal semantics, foreign keys, and joins that encode relationships no single row captures on its own.

The most common response to this has been: serialize the tables into text and feed them to the LLM. Ask it to reason like a data analyst. And for small tables and simple questions, this works. But it falls apart at scale and under pressure, for structural reasons.

Research surveying dozens of benchmarks consistently shows that gradient-boosted trees and purpose-built tabular models still beat LLMs on accuracy, calibration, and training efficiency for structured prediction tasks.[4] Microsoft's research into LLM reasoning on tables found that even basic consistency checks degrade once tables exceed context limits — which happens fast when you're dealing with real enterprise data volumes.[6] The relational structure that makes data meaningful — the foreign keys, the time windows, the multi-table joins — gets nuked when you flatten it into prose.

More fundamentally: fraud detection, churn prediction, underwriting — these need calibrated probabilities, not eloquent prose. They need reproducible metrics, explainability, and governance chains that regulators can audit. An LLM generating text is the wrong substrate for that, no matter how well you prompt it. Jure Leskovec put it plainly: forcing LLMs to operate directly on relational data is architecturally misaligned with how enterprises encode their most valuable information.[9]

The Architecture That's Actually Starting to Work

The pattern I'm seeing in the deployments that do show P&L impact is a split stack. LLMs handle language-native tasks — reasoning in natural language, powering conversational interfaces, orchestrating workflows, generating explanations. A separate layer handles predictions over structured data, with dedicated models that understand relational semantics.

Relational foundation models (RFMs) are one emerging approach to that second layer. KumoRFM is a useful reference point for understanding the concept.[5] The idea is to map enterprise databases into temporal heterogeneous graphs — tables become node types, foreign keys become edges — and train a model that attends over multi-table, multi-modal inputs natively, without serializing everything to text first. Users express prediction tasks through something like a declarative query language; the model handles the feature learning.

The reported results are interesting. On RELBENCH benchmarks, the RFM approach beats expert-built baselines by several percentage points on average, with fine-tuning pushing further. The case studies — lead scoring accuracy improvements at Databricks, ads accuracy lifts at Reddit, a Fortune 500 marketplace projecting significant GMV upside with lower serving costs — are the kind of numbers that get executives' attention.[8] I'd treat specific projections like the "$100M GMV uplift" figure with appropriate skepticism until you've seen the methodology, but the directional claim — that models designed for relational data outperform text-native models on relational prediction tasks — feels right to me and matches what I've seen in practice.

None of this replaces LLMs. What it does is give them better data to reason about. The LLM copilot stops hallucinating revenue figures because it's querying a calibrated churn score from a model designed for that problem, not trying to estimate churn from serialized spreadsheets.

What a More Complete Stack Looks Like

The resilient enterprise AI architecture I'm working toward with clients these days has roughly six layers, and the important insight is how they connect.

The data platform is the foundation — Snowflake, BigQuery, Databricks, Redshift, whatever holds the transactional and event backbone. This is where truth lives. Everything above it is either reading from it or writing back to it.

The predictive intelligence layer sits directly on top of the warehouse: RFMs, gradient-boosted models, purpose-built classifiers. These produce calibrated scores and forecasts that the rest of the system can trust. They don't speak natural language; they speak probabilities and confidence intervals.

Then there's a knowledge and retrieval layer — vector databases, embeddings, document stores — for the genuinely text-native knowledge that lives in policies, contracts, specs, and support conversations. This fuels RAG systems and gives the LLM layer something to cite.

The LLM and agentic layer sits on top of both: general and domain-specific models handling natural-language querying, explanation generation, and workflow orchestration. McKinsey notes that about 62% of organizations are experimenting with AI agents, but only around 23% have scaled them.[1] The gap is usually the absence of reliable backends. Agents that have to hallucinate the answer to "what's our current churn rate?" don't survive contact with real operations.

Application and workflow is where this surfaces to users — internal tools, customer-facing products, dashboards embedding both predictive outputs and conversational UX, with appropriate human-in-the-loop controls for high-stakes decisions.

Governance is the layer that makes everything else sustainable. Central risk, compliance, and measurement capabilities. 72% of enterprises are now managing overlapping AI tools that create cost, governance, and integration risk.[3] Without governance, AI sprawl becomes its own kind of technical debt.

What This Means in Practice

The single most useful question I've learned to ask when reviewing an AI deployment is: where does the model sit relative to the warehouse? If the answer is "in a separate chatbot that never reads or writes back to core systems," expect shallow value regardless of how impressive the demo looks. The model can only be as useful as the data it can access and act on.

For founders and technical leaders building AI products: lead with the problem, not the model. Anchor on churn, fraud, underwriting, demand forecasting — whatever actually drives the P&L — and work backward from the tables that matter. Then use LLMs where they genuinely shine: natural-language interfaces, explanation generation, workflow orchestration. Don't ask the LLM "who will churn?" Ask it how to communicate to a customer, given a calibrated churn score from a model designed for that problem.

Governance on day zero is not premature optimization. Role-based access, audit logs, unified observability — these reduce AI sprawl and actually accelerate procurement with enterprise customers who've been burned before.

The first wave of enterprise GenAI was LLM-centric and interface-driven. It reimagined knowledge work in genuinely useful ways. The next wave is data-native: LLMs for language and orchestration, relational and domain-specific models for predictions, all tied together in a governed stack that connects to verifiable business outcomes. That's where durable value will accrue — and it requires treating the LLM as one component of a larger architecture rather than the whole thing.

References

  1. McKinsey & Company, "The state of AI in 2025: Agents, innovation, and transformation," 2025.
  2. Times of India, "MIT study finds 95% of generative AI projects are failing," 2025.
  3. TechRadar Pro, "Tackling AI sprawl in the modern enterprise," 2025.
  4. Liu et al., "Large Language Models on Tabular Data – A Survey," 2024.
  5. Kumo AI, "Introducing KumoRFM: A Foundation Model for In-Context Learning on Relational Data," 2025.
  6. Sui et al., "Table Meets LLM: Can Large Language Models Understand Structured Table Data?" WSDM 2024.
  7. Kumo AI Documentation, "Understanding Predictive Query (PQL)," 2025.
  8. Skywork, "Kumo: The Ultimate Guide to Predictive AI on Relational Data," 2025.
  9. Jure Leskovec, "Why LLMs Aren't a One-Size-Fits-All Solution for Enterprises," Towards Data Science, 2025.
← Back to Home