The Enterprise AI Bottleneck Is Not the Model

Every enterprise AI project I've worked on in the past two years has started the same way. A senior leader sees a demo. The demo is impressive. The leader asks, "How quickly can we build this?" A team is assembled. A model is selected. A proof-of-concept is delivered in four weeks. Everyone is excited.

Then the project stalls. Not for weeks. For months.

It doesn't stall because the model isn't good enough. The models are extraordinary — any of the major ones can handle the core intelligence task for most enterprise use cases. The model is the easiest part. It's the five percent of the project that gets ninety-five percent of the attention.

The project stalls because of everything around the model. Data pipelines that don't exist. Integrations that take three times longer than estimated. Organizational ambiguity about who owns outcomes. Evaluation infrastructure that nobody thought to build. The hundred edge cases that the demo conveniently avoided.

I've led teams through six enterprise AI deployments over the past two years — internal knowledge assistants, customer-facing document processing systems, compliance automation. Four are in production. One was shelved after we realized the data quality problem was unsolvable within the client's constraints. One I walked away from entirely, which I'll come back to. Here's what I've learned about where these projects actually get stuck.

The Data Problem Everyone Expects and Nobody Prepares For

Your AI project is a data engineering project that happens to use a language model. Most stakeholders nod politely and then ignore this.

Post this

I tell every stakeholder the same thing at project kickoff: your AI project is a data engineering project that happens to use a language model. Most of them nod politely and then ignore this until they discover it's true, usually about six weeks in.

The pattern is consistent. We ask for the data. The client says, "We have lots of data." We ask where it lives. The answer is: in a SQL Server database from 2014, in SharePoint folders organized by someone who left the company three years ago, in email threads, in PDFs on a shared drive, in a proprietary system with no API, and in the heads of three subject matter experts who are too busy to talk to us.

This is not an exaggeration. On one project, we spent the first two months just building a data inventory. The client — a mid-size financial services company — believed they had a centralized knowledge base. What they actually had was 47 SharePoint sites, 12 shared drives, three Confluence spaces (two abandoned), a legacy wiki nobody could log into anymore, and approximately 180,000 documents in formats ranging from modern DOCX to scanned PDFs from the 1990s with no OCR. When we showed them the inventory, there was a long silence on the call.

Getting that data into a format an LLM can use is where the majority of project time goes. In our experience, data pipeline work consumes somewhere between 60-70% of total project effort. Discovery and cataloging alone takes weeks on a typical engagement. Extraction and normalization — pulling data out of source systems, converting formats, handling encoding issues, dealing with scanned documents where OCR quality varies wildly — is another major chunk. And then cleaning: on one project, 30% of the documents in the "approved policies" folder were actually drafts that had never been ratified. Nobody knew.

Keeping the data current is the part people forget entirely. Documents get updated, new ones are created, old ones are deprecated. Building a pipeline that stays in sync with source systems is an ongoing engineering challenge, not a one-time task. We now require a two-week "data readiness assessment" before committing to any project timeline. We've turned down projects based on that assessment — not because the AI part was hard, but because the data part was impossible within the client's budget and timeline.

Integration Debt Is a Full Project Inside Your Project

Enterprise systems weren't built for AI. They were built for humans clicking buttons in web interfaces, and sometimes for other systems exchanging structured data through carefully negotiated API contracts. The flexible, high-throughput, low-latency integration that AI applications need is fundamentally different from what most enterprise architectures were designed to support.

On a recent project, we needed to integrate with seven enterprise systems. Here's what we found, roughly in order of how bad it got:

The CRM had a REST API but was rate-limited to 100 requests per minute with no bulk operations. Syncing 50,000 customer records took eight hours. The ERP had a SOAP API documented in a 400-page PDF from 2016 — half the endpoints no longer existed, and the ones that did required a VPN connection and a client certificate that took three weeks to provision. The document management system had no API at all, just a "web services" interface that was actually a screen-scraping layer someone built in 2018, which broke every time the vendor updated their UI. The ticketing system had a modern GraphQL API, which was wonderful, except we needed real-time updates and the API only supported polling. The HR platform required SFTP file transfers on a nightly batch schedule, and getting access to the SFTP server required approvals from IT security, HR, and legal — six weeks. The email system had good APIs but strict data governance policies that prohibited storing email content outside the platform, so we had to build a real-time processing pipeline that read, processed, and discarded emails without persisting them. The custom internal application was built by a team that no longer existed, had no API, no documentation, and a database schema that could best be described as "creative."

Every one of these integrations was a mini-project. Together they consumed about 40% of total engineering effort and were the primary driver of schedule delays. The AI model itself took roughly two weeks to configure and deploy. The integrations took four months.

We now build an integration abstraction layer early in every project — standardized interfaces that our AI application talks to, with adapters for each source system behind them. When a source system changes (and they always change), we update the adapter, not the AI application. We also budget integration effort explicitly and separately from AI development. Our rule of thumb: take your initial integration estimate and multiply by 2.5. After six projects, that multiplier has been remarkably consistent. The unknowns in enterprise integration aren't technical — they're organizational. Getting access, getting approvals, finding the person who understands the legacy system: that's what takes time.

Nobody Owns the Thing

This is the bottleneck nobody wants to talk about because it's not a technical problem. It's a people problem. And it's the single most common reason enterprise AI projects fail to deliver value even when they succeed technically.

The pattern: an AI project gets sponsored by a senior leader. A team is formed. The team builds something that works. But nobody has defined who owns the AI system's outputs. Nobody has decided how AI-generated content fits into existing approval workflows. Nobody has updated job descriptions to reflect new AI-augmented processes. Nobody has trained the people who are supposed to use the system. And critically, nobody has the authority to make binding decisions when competing teams have conflicting requirements.

On one project, we built a document processing system that could extract, classify, and summarize incoming correspondence at about 94% accuracy. Technically, a success. But when we tried to deploy it into the operational workflow, we found that the operations team wanted 99% accuracy before they'd trust it, the legal team required human sign-off on every AI-generated summary (eliminating most of the efficiency gains), the product team wanted to expose the summaries to customers while the compliance team vetoed it, and the IT team wanted on-premises hosting which tripled infrastructure costs. Nobody had the authority to resolve these conflicts. Each team optimized for their own constraints, and the result was a system that was technically capable but operationally paralyzed.

The project sat in limbo for three months. Then a VP was given explicit ownership and authority to make tradeoff decisions. Within two weeks, she had resolved every conflict. The system went live a month later. Nothing technically changed. The organizational structure changed.

We won't start a project now without a named owner who has decision-making authority over scope, quality targets, deployment model, and operational processes. Not a "sponsor" who checks in monthly. An owner who's accountable for the outcome and can make decisions on the spot when tradeoffs arise.

The Project I Walked Away From

I mentioned that one project I walked away from entirely. It was a manufacturing company that wanted an AI system to analyze production line data and surface quality anomalies. The business case was compelling. The AI part was straightforward.

The problem was the data. It lived in seven different SCADA systems from five different vendors, none of which had APIs, two of which were running on operating systems Microsoft had stopped supporting, and one of which was physically air-gapped for security reasons. The client's IT team had a six-month backlog. The operational team didn't trust IT. Nobody had budget authority to accelerate the integration work. And the business unit that wanted the AI system had no leverage over the teams that controlled the data.

I spent four weeks trying to find a path forward. I couldn't. We walked away. It was the right call — the project would have consumed a year of engineering effort and delivered nothing, because the prerequisite work was stuck in organizational quicksand that we had no way to move. I'm still not entirely sure I made the right call — maybe someone with more patience could have threaded that needle — but eighteen months later, I haven't seen evidence that the client ever solved the underlying problem either.

If You Can't Measure It, You Can't Improve It

Here's a question I ask every team that wants to deploy an AI system: how will you know if it's working? The most common answer is some version of "we'll try it and see." That's not evaluation. That's hope.

On one project, we were asked to build a system that would "improve customer response quality." We asked how response quality was currently measured. It wasn't. There was a customer satisfaction survey with a 4% response rate, making it statistically useless. No internal quality rubrics, no baseline measurements, no historical data on response accuracy or completeness.

We spent three weeks building an evaluation framework before writing a single line of AI code — defining a quality rubric, scoring historical responses against it using independent human reviewers, establishing inter-rater reliability, building an automated evaluation pipeline, and setting up an A/B testing framework. The client initially saw this as overhead. But without it, we would have had no way to prove the AI system was better than what it replaced. And in fact, our first model configuration performed worse than human responses on two of five quality dimensions. We only knew because we could measure it. We iterated, and by the third configuration the AI outperformed on four of five dimensions. We couldn't have gotten there without the evaluation infrastructure.

Teams that ship AI without evaluation infrastructure are making decisions based on vibes. The queries people try during testing are not representative of production traffic — they tend to be the easy cases, the ones that make the demo look good. Model performance drifts over time as data distributions shift and user behavior evolves. Without continuous measurement, you won't notice degradation until users start complaining.

The Last Mile Is Bigger Than You Think

The demo works. The proof-of-concept impresses stakeholders. The evaluation metrics look good. Now you need to put it in production, and everything changes.

In demo, you control the inputs. In production, users do whatever they want. They paste entire documents into a chat field, ask questions in languages you didn't test for, submit empty queries, find workflows you never imagined. On one deployment, we discovered that about 12% of production queries were in a language we hadn't tested for. Not because we were careless — the client had told us their users were English-speaking. They were, mostly. But 12% of the time, they pasted content in other languages and asked questions about it. Our system handled this gracefully for some languages and catastrophically for others. Building robust multilingual support added six weeks to the project.

Cost management is the other production surprise. In demo, nobody cares about cost. Our most expensive deployment processes around 15,000 queries per day. At roughly 3,500 input tokens and 800 output tokens per query, with retrieval augmentation adding an embedding operation per query, daily inference cost is somewhere around $180 — call it $5K-$6K per month, which is manageable but not trivial. The challenge is that costs aren't linear. Complex queries require more tokens. Safety guardrails add classifier inference costs. We had one month where a new feature drove a 3x increase in query volume and our inference costs tripled before we could implement caching and optimization to bring them back down. We now build cost monitoring and alerting into every deployment, with automated throttling that kicks in if costs exceed budget.

And quality degrades over time in ways that are subtle and hard to detect without systematic monitoring. Source documents get updated. Business processes evolve. The model provider updates their model, sometimes without notice. We experienced this firsthand when a provider pushed an update that slightly changed the output format of their API. Our parsing logic broke silently — the system continued to operate, but about 15% of responses were being truncated. Our evaluation monitoring caught the quality drop within 24 hours. If we'd been relying on user complaints, it might have taken weeks.

The Readiness Checklist (Shortened)

Before we start any enterprise AI project now, I walk through five questions with the client. The model selection isn't one of them — that comes last, if at all.

Do you know where all relevant data lives, can you access it programmatically, and is someone responsible for keeping it clean? Is there a named owner with actual decision-making authority over the AI initiative — not a sponsor who checks in monthly, but someone accountable for outcomes? How will you measure whether the system is working, and do you have a baseline for the current process? Do you have a plan for ongoing operations — who monitors it in production, who handles edge cases, who responds when it fails? And have legal and security reviewed the data flows, including where data can be processed or stored?

If a client can't answer those five questions with confidence, we're not ready to select a model. We've watched teams spend months evaluating GPT-4 versus Claude versus Gemini while their data sat in incompatible silos and nobody had defined what success looked like. The model comparison is a comfortable technical exercise that lets everyone avoid the harder organizational and infrastructure work.

The Projects That Worked Were Worth It

I'm not writing this to be discouraging. The enterprise AI projects we've delivered are genuinely transformative. One reduces document processing time from 45 minutes to 3 minutes per case. Another improved first-response accuracy by roughly a third while cutting average response time by more than half. These are real, measurable improvements that justify the investment.

But every one of those successes required fighting through all of the above. The data pipeline work. The integration challenges. The organizational alignment. The evaluation infrastructure. The last-mile production hardening. In each case, the model was the easy part.

If your enterprise AI strategy starts with "which model should we use," you're starting in the wrong place. Start with your data. Start with your integrations. Start with your organizational readiness. Start with your evaluation plan. Get those right, and the model will almost take care of itself.

The hard parts of enterprise AI aren't the parts anyone talks about at conferences. But they're the parts that determine whether your AI project delivers value or becomes another expensive proof-of-concept that never made it to production.

References

Sambasivan, N., et al., "Everyone Wants to Do the Model Work, Not the Data Work: Data Cascades in High-Stakes AI," CHI 2021, ACM.
Paleyes, A., Urma, R., and Lawrence, N.D., "Challenges in Deploying Machine Learning: A Survey of Case Studies," ACM Computing Surveys, 2022.
Sculley, D., et al., "Hidden Technical Debt in Machine Learning Systems," NeurIPS 2015.
Shankar, S., et al., "Operationalizing Machine Learning: An Interview Study," arXiv:2209.09125, 2022.
Nahar, N., et al., "Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process," ICSE 2022.
Lwakatare, L.E., et al., "A Taxonomy of Software Engineering Challenges for Machine Learning Systems," ESE Journal, 2020.
Polyzotis, N., et al., "Data Lifecycle Challenges in Production Machine Learning," ACM SIGMOD Record, 2018.
Amershi, S., et al., "Software Engineering for Machine Learning: A Case Study," ICSE-SEIP 2019, IEEE.

Updated March 4, 2026