AI Diligence Is Becoming Technical Diligence
A good AI demo compresses uncertainty. It hides the dull but important questions: where the data comes from, how quality is measured, what happens when the model is wrong, who pays for inference at volume, whether the system can leak private data, and who owns the incident when the agent does something ambitious at 2:17 a.m.
Diligence has to do the opposite. It has to expand the demo back into a system. Not to kill the excitement, but to find out whether the excitement is attached to something durable.
That is why AI diligence is becoming technical diligence again. The question is no longer "which model are you using?" That question still matters, but it is rarely the center. The better question is: can this company repeatedly turn model capability into reliable product outcomes?
Evaluate The AI System, Not The Model
When I look at an AI product, I do not start with a leaderboard. I start with the workflow. What job is the product changing? Where does the user trust the answer? What evidence says the product is better than the previous process? What happens when the output is incomplete, unsafe, late, expensive, or confidently wrong?
The model is one component in a larger operating system: data inputs, retrieval, orchestration, tools, human review, policy controls, telemetry, cost management, deployment process, customer support, and product feedback loops. A startup can have a modest model choice and a strong system. It can also have a frontier model and a fragile business.
The Minimum Serious Diligence Grid
The diligence grid I use has six domains.
Workflow fit. Does the AI product own a painful, repeated job, or does it merely decorate an existing interface? Strong evidence includes customer workflows, adoption traces, retention, time saved, error reduced, or support load changed. Weak evidence is a magical demo with no production behavior behind it.
Evaluation. Can the team measure output quality in a way that survives contact with real users? I would want to see regression tests, adversarial examples, sampling process, human-review rubric, failure taxonomy, and production-quality monitoring. "We tried it and it looked good" is not evaluation. It is a vibe wearing a blazer.
Security and misuse. LLM applications introduce failure modes that normal web-app security reviews do not fully cover. OWASP's LLM Top 10 is useful because it asks about prompt injection, insecure output handling, sensitive information disclosure, excessive agency, supply chain risk, and overreliance. Diligence should ask what the system can read, reveal, decide, trigger, and change.
Economics. AI cost is not just input tokens and output tokens. It includes retrieval, embeddings, caching, retries, latency budgets, monitoring, human review, support, compliance overhead, and failed attempts. A product can look attractive at demo volume and unattractive at customer volume.
Governance. Who owns AI behavior after launch? NIST's AI RMF language is helpful here because it forces teams to think in terms of govern, map, measure, and manage. Mature teams have named owners, escalation paths, release gates, and incident learning loops. Immature teams have a committee and a slide titled "Responsible AI."
Defensibility. The interesting question is not whether the company uses AI. It is whether usage makes the product better. Does the company have workflow depth, proprietary data rights, distribution, domain expertise, integration advantage, switching cost, or a feedback loop that compounds?
What Good Evidence Looks Like
Good AI diligence should move up an evidence ladder. At the bottom are demos and curated examples. They are useful for orientation, but they should not carry the deal. Higher up are regression evaluation suites, red-team reports, security control evidence, cost traces, customer deployment artifacts, support-ticket deltas, incident reports, and monitored production outcomes.
Founders should not read this as "build enterprise bureaucracy before seed." Stage matters. But the best early teams can still answer cleanly: what do we measure, what have we learned, what breaks, what is expensive, what is unsafe, what is improving, and what are we not claiming yet?
The Better Diligence Conversation
A weak diligence conversation tries to prove that the AI is impressive. A strong one tries to understand where the system creates leverage and where it creates risk.
For investors, this changes the question from "is this AI real?" to "what evidence says this AI product can survive customers, security review, usage growth, regulation, and margin pressure?"
For operators, it changes preparation. Do not walk into diligence with only a product deck. Bring the evals, cost model, architecture diagram, security notes, data-rights explanation, workflow evidence, and the list of things you are deliberately not automating yet.
The best AI diligence does not make both sides more theatrical. It makes both sides smarter about the system.