The Agentic Shift: Building AI That Does, Not Just Suggests

Pranav Kulkarni · 10 min read · Updated December 1, 2025

The first wave of enterprise GenAI gave us copilots. They drafted emails. They summarized meetings that probably should have been emails. And they were genuinely useful in a narrow, low-stakes way. The second wave is structurally different — models that don't just suggest but act. That shift sounds incremental. It isn't.

From what I've seen, the gap isn't the model. About two-thirds of large companies are piloting autonomous agents right now, but only somewhere around a quarter have actually scaled them. The rest are stuck — not because the AI can't perform, but because the trust infrastructure, the integrations, and the governance don't exist yet. A McKinsey report from last year confirmed what I'd been experiencing firsthand: most of these pilots are islands.^[1]

An agent that can’t write to your CRM, can’t send an email, can’t trigger a workflow — it’s just a very expensive recommendation engine.

Post this

There's also the uncomfortable finding from MIT that roughly 95% of GenAI initiatives with shallow integrations created zero P&L impact.^[2] That number is alarming but not surprising. An agent that can't write to your CRM, can't send an email, can't trigger a workflow — it's just a very expensive recommendation engine. Expensive and ultimately forgettable.

Three modes of autonomy (not two)

Most people frame this as "AI assists humans" vs. "AI replaces humans," which is a false binary that leads to bad decisions. I think about it in three layers that map more cleanly to where teams actually get stuck:

Suggestion mode — the agent drafts, humans act. Low risk because the human is still the effector. Low leverage for the same reason. Most enterprise deployments live here permanently because nobody pushed them further.

Supervised autonomy — the agent executes, but only within pre-defined "safe action classes." Anything outside those guardrails triggers a human approval. This is where most mature teams sit right now. It's not the final destination, but it's a real and useful middle ground. The key investment here is defining what "safe" means precisely enough that the agent isn't asking for approval on every tiny step.

Full autonomy — humans set objectives, review telemetry, and handle exceptions. The agent owns outcomes, not outputs. I've seen this work reliably for support ticket routing at scale, for CRM hygiene, for some financial data reconciliation tasks. I haven't seen it work reliably for anything with a large blast radius or low reversibility. Maybe I'm being too conservative, but I'd rather be wrong in that direction.

Moving up this ladder is not primarily about swapping in a better model. It's about building trust infrastructure: telemetry, playbooks, and explicit decision rights about who owns what when something goes wrong.

A real example — the one that taught me most of this

A colleague and I were building an autonomous agent for a B2B support workflow. The idea was straightforward: the agent would triage incoming tickets, resolve the ones it could, escalate the ones it couldn't, and update the CRM at each step. We had a clean spec. We'd run it in shadow mode for three weeks and the agreement rate with human agents was over 90%. We felt good about it.

The third day in production, the upstream ticketing API started returning intermittent 503s. Nothing unusual — it happens. But the agent didn't have a retry budget. It didn't have a circuit breaker. So it retried indefinitely on a handful of tickets, generating duplicate updates in the CRM, which triggered auto-replies to customers, which generated new tickets. By the time someone noticed, we had about 200 phantom tickets in the queue and two genuinely confused enterprise customers who'd received five emails each about an issue that had been closed two days ago.

We fixed it in about a day. But the lesson was durable: an agent acting in the world is not like a script acting in the world. Scripts fail loudly. Agents fail productively, in the sense that they keep trying to accomplish goals through any path available. You have to treat every tool call, every API write, every state mutation as a potential blast radius calculation — not just the happy path.

When to actually let the model act

After that incident, I started gating autonomy behind four questions rather than a vague confidence threshold:

Can we undo this in under five minutes? If not, there needs to be a human in the loop. Not because the model is untrustworthy, but because even a 98% success rate produces incidents at scale, and you need the ability to recover quickly.

How many customers or systems does a failure touch? I call this blast radius, borrowing from SRE. I start agents on internal tooling — expense categorization, CRM enrichment, Slack summarization — where a bad output is annoying but not customer-facing. External-facing actions come later, after the model has earned some trust on lower-stakes work.

What does human verification actually cost? This one surprises people. If a human reviewing an agent's output takes roughly the same time as a human doing the task from scratch, the autonomy isn't buying you much. The ROI math only works when verification is dramatically faster than execution.

How quickly does feedback arrive? A support response has a feedback loop measured in hours. A quarterly financial close has a feedback loop measured in weeks. Slow feedback means you can't catch drift early. I'm much more willing to grant autonomy in fast-feedback domains.

The failure modes that keep biting people

Beyond the incident I described above, here are the ones I see most often:

Confident hallucinations combined with write access are genuinely dangerous. A model that hallucinates a customer record and then emails that customer with personalized details drawn from that hallucination is not a minor inconvenience. Schema validation and database existence checks need to happen before any side effect fires. This sounds obvious and yet I've seen it skipped on more projects than I'd like to admit.

Scope creep from vague goals is subtle but devastating. "Help close tickets faster" is not an instruction. It's an invitation for the agent to discover creative solutions you never intended. Treat agent instructions like API contracts — explicit verbs, explicit constraints, explicit definitions of success. "Categorize tickets by urgency score 1–5 and assign to the matching queue; do not modify ticket content or contact customers directly" is an instruction.

Permission bloat compounds over time. Agents accumulate credentials. A bot that needed read access six months ago got write access for a specific task and the write access was never revoked. Then a new agent is deployed using the same service account because it was convenient. I've seen this produce genuinely frightening permission trees in large organizations — overlapping bots with escalated rights and no clear ownership.^[3] Centralize access management. Expire secrets aggressively. Treat agents like external contractors: least privilege, time-limited, audited.

Silent failures are the worst kind. The agent reports success because no exception was thrown. Meanwhile, the email went to a spam folder, the record update silently failed validation, the payment hit a pending state that nobody monitors. Outcome probes — separate checks that verify the real-world effect actually happened — are not optional. They're the whole game.

Building the trust infrastructure

The playbook I've landed on has four phases, and the discipline is in not skipping them:

Shadow mode first. Agents run in parallel alongside humans, logging what they would have done. You measure agreement rate, false positive rate, and the distribution of confidence scores. You don't just look at the aggregate — you look at the edge cases, because that's where the failures are hiding. I typically run shadow mode for at least three weeks, longer for high-stakes domains.

Supervised mode next. Agents act, humans approve each action class. Yes, this creates latency. Yes, the humans sometimes get tired of approving and start rubber-stamping. That rubber-stamping is data — it tells you which action classes are ready for full autonomy. Capture the reasons for rejection carefully; they become your guardrail specification.

Autonomous mode for the action classes that earned it. Not the whole agent — specific categories of action. The agent executes, humans review metrics daily. Any anomaly in the metrics triggers a rollback to supervised mode automatically, not as a manual decision. This is important: you don't want to rely on someone noticing something feels off.

Outcome mode is the target state for mature deployments. Humans set objectives — "close 200 tickets per day at a customer satisfaction score above threshold" — and the agent manages the playbook end-to-end. I've seen this work. It's not magic; it's months of careful instrumentation and trust-building. But it works.

The architecture underneath

A few things that I've found matter more than model choice:

Wrap every API and database in a tool abstraction layer — signed, rate-limited functions with deterministic schemas. Not because the model needs it, but because you need it for auditing. When an incident happens, you want to be able to replay exactly what the agent called and why. Microsoft's AutoGen and similar frameworks help reduce the bespoke glue code, but the underlying principle is the same regardless of framework.^[4]

Persist plan state so agents survive restarts. I've been storing intent, intermediate outputs, and context in Redis plus Postgres on recent projects. Auditors can replay decisions. Engineers can debug failures without guessing what the agent was thinking when it called an API at 2am.

Every agent gets a kill switch. This is non-negotiable. New behaviors always launch in a synthetic environment seeded with realistic data before they touch production. Chaos drills are scheduled, not improvised.

The organizational changes nobody talks about enough

This is where most agent programs actually fail, and it's not a technical problem.

You need goal owners — not just model owners. Someone in product or operations who owns the outcome metrics and curates what action catalog the agent has access to. Without this, the agent's goals drift with whoever happened to write the last system prompt update.

You need guardrail engineers. People whose job is to encode policies, design detection rules, and review incidents. This is a real specialty and it's different from ML engineering or software engineering. I've seen teams underinvest here badly — one person trying to do model selection, evaluation, guardrails, and on-call simultaneously, burning out inside six months.

You need an exception desk. A small human team that handles escalations and feeds learnings back into the agent. Not a support team. A feedback loop with accountability for closing the loop.

And you need evidence reviews — something like SRE postmortems, but weekly, where telemetry and business impact are actually inspected by people with authority to change things. Without this, small failures accumulate silently until they become a large failure that surprises everyone.

A rough roadmap

If I were starting a new agent program today, I'd structure it roughly like this: spend the first quarter building evaluation harnesses and running shadow pilots. Don't ship anything autonomous yet. Graduate two or three reversible workflows to supervised autonomy in the second quarter — CRM hygiene, expense categorization, something internal and low-stakes — and instrument absolutely everything. In the third quarter, expand to revenue-adjacent work: renewal prep, incident triage. Introduce outcome-based SLAs. Then, maybe, by the end of the year, aim for objective-level automation in the workflows where you have months of evidence and working telemetry.

That timeline feels slow to a lot of stakeholders. My experience is that it's actually fast when you compare it to teams that skip steps and spend months unwinding bad deployments.

The shift to agentic AI isn't about replacing people. It's about redeploying them — away from repetitive execution and toward designing guardrails, diagnosing edge cases, and building the trust infrastructure that makes the whole thing actually work. The teams I've seen succeed treat that infrastructure as the product. Everything else is implementation detail.