AI Safety Is Not a Checkbox: Building Guardrails That Actually Hold

Last quarter, one of our LLM-powered applications generated a response that included a customer's home address. Not because the model was trained on that data — because an upstream system had leaked PII into the context window through a poorly sanitized database field, and our prompt assembly pipeline dutifully passed it along. The model, doing exactly what it was asked, surfaced it in a customer-facing summary.

Nobody was harmed. Our output guardrails caught it before it reached the end user. The whole thing — generation, detection, suppression — took 340 milliseconds. But when I reviewed the logs that evening, I sat with the weight of what could have happened if we'd treated safety as a compliance exercise instead of an engineering discipline.

I've sat through dozens of safety reviews. The pattern is almost always the same: someone produces a document listing the risks, someone else signs off on it, the team moves on to shipping features. The document lives in Confluence. Nobody reads it again until an auditor asks. This is compliance theater — it satisfies the organizational need to feel like safety has been addressed without actually building systems that enforce it.

I get why it happens. Safety engineering is expensive, adds latency, creates friction with product teams, and the ROI is invisible until something goes wrong. Every dollar on safety is a dollar not on features, and features drive revenue. Safety prevents losses nobody can quantify until the loss actually occurs.

If your safety strategy is a PDF document and a quarterly review meeting, you don’t have a safety strategy. You have a liability document.

Post this

But if your safety strategy is a PDF document and a quarterly review meeting, you don't have a safety strategy. You have a liability document. Real safety is runtime code that executes on every request, catches problems in milliseconds, and degrades gracefully when it encounters something it wasn't designed for.

What We Built and Why

Over the past 18 months, my team built what I'd call a serious production safety system. It's not perfect — we still find gaps. But it has caught over 14,000 problematic outputs in production, and I want to walk through how we built it, what it actually costs, and why the tradeoffs are worth it.

The architecture runs in three main phases: before the prompt reaches the model, at the model level, and on the output side. I'm going to be honest that this took much longer to build well than we expected, and we got some of the sequencing wrong along the way.

Before the Model Sees Anything

This is the layer most teams skip entirely, and it's arguably the most important one. Our input guardrail system has three components: content classification, PII detection, and injection detection — all running in parallel to minimize latency impact.

The content classifier categorizes incoming requests into risk tiers. Tier 1 is benign: summarize this document, extract these fields. Tier 2 is sensitive: anything involving personal data, financial information, health records. Tier 3 is high-risk: regulatory content, legal language, anything that could constitute advice. The risk tier determines which downstream controls activate.

Our classifier runs at around 12ms P95 latency and has a false positive rate of about 3%. That 3% matters — it means roughly 1 in 30 benign requests gets elevated unnecessarily, adding latency and cost. We got that number down from around 9% when we first deployed it. The key was building an evaluation dataset from our own production traffic, not relying on generic benchmarks that don't reflect how our users actually interact with the system.

PII detection runs on all inputs. It catches cases where upstream systems or user inputs contain personal information that should not enter the model's context. We use regex patterns for structured PII and a named entity recognition model for unstructured PII. When PII is detected, we don't simply block the request — we redact it with placeholder tokens, process the request, and then fail the output if the model attempts to reconstruct or reference the redacted information. This lets us handle legitimate use cases while preventing the model from surfacing personal information in ways it shouldn't.

Prompt injection is the vulnerability that keeps me up at night. Not because our models are uniquely vulnerable, but because the attack surface grows with every new integration point and data source. Our injection detection uses a two-stage approach: a heuristic scanner for known patterns, then a small classifier specifically trained on adversarial prompts. The classifier caught several injection attempts the heuristic scanner missed, including one that used Unicode bidirectional override characters to hide instructions within seemingly normal text. (I didn't know that attack existed before we found it in the wild.)

Total input guardrail latency: roughly 35ms at P95 with all three running in parallel. That's acceptable for our use cases where end-to-end response times are typically 2-4 seconds.

At the Model Level

I've reviewed system prompts across a dozen enterprise deployments. Most are variations of "You are a helpful assistant. Do not say anything harmful." This is approximately as effective as putting a "Please Do Not Rob This Store" sign on a bank vault.

Effective system prompts are specific, structured, and tested. Ours include explicit boundary definitions: what the model can and cannot discuss, what format outputs must follow, what information sources it should reference, and what to do when it encounters ambiguity. We version-control our system prompts and treat changes to them with the same rigor as code changes. They go through review, they have tests, they deploy through our standard CI/CD pipeline.

One pattern that's worked well is what I call "structured refusal." Instead of telling the model not to discuss certain topics, we give it explicit instructions for how to handle out-of-scope requests: acknowledge the question, explain why it can't help, redirect to an appropriate resource. This produces much better user experiences than the model simply refusing or generating a canned "I can't help with that."

We also vary temperature and sampling parameters based on risk tier. Tier 3 requests use very low temperature, which constrains the model to its highest-confidence outputs. It's a blunt instrument — lower temperature doesn't mean "safer" — but combined with strict system prompts, it significantly reduces the surface area for problematic outputs in sensitive contexts.

After the Model Responds

Every model output passes through a filtering pipeline: toxicity check, PII leakage check, hallucinated citation check, and format validation. The toxicity classifier is the most expensive component at roughly 25ms per request. We debated using a simpler keyword-based approach, but keyword lists alone missed a substantial portion of the problematic content the classifier caught. Subtle toxicity, implied bias, context-dependent harm — none of that is detectable by string matching.

For applications where factual accuracy matters — which in enterprise is almost always — we verify model outputs against grounding documents. We chunk the model's output into claims, embed each claim, and compare against embeddings of the source documents. Claims that fall below a similarity threshold get flagged. If more than about 20% of claims in a response are flagged, the entire response is suppressed and we fall back to returning relevant source excerpts directly.

This system catches hallucinations well, though it misses cases where the model's fabrication is semantically similar to real content but factually wrong in a subtle way — getting a date or a number slightly off, citing a regulation that almost exists. I'm still not sure we've fully solved that one.

Our output guardrails add roughly 80-120ms to response time depending on output complexity. Combined with input guardrails, total safety overhead is about 130-160ms. On a typical 2.5-second response, that's a 5-6% latency increase.

Product teams notice. They've asked us to make safety checks optional, run them asynchronously, create a "fast path" for trusted users. We've said no to all of these. Safety that can be bypassed is not safety. And retroactive detection means the problematic content has already reached the user — at that point you're doing damage control, not prevention.

What It Costs

Our safety infrastructure adds roughly 15% to total inference costs. Classifiers, embedding operations, additional API calls for multi-model checks — it adds up. On our current volume, that's somewhere around $3,000-$3,500 per month in additional infrastructure costs.

I present this number to leadership every quarter, alongside the number of incidents prevented. So far the math has always worked in our favor. When you put a dollar figure on what it would cost to handle a PII exposure incident — regulatory scrutiny, customer churn, legal liability, reputational damage — the $3K/month looks very reasonable. But you have to actually do that calculation, because "safety prevents something bad from happening" is the hardest possible thing to take credit for organizationally.

What Good Monitoring Looks Like

Guardrails are only as good as your ability to see them working. We built a real-time monitoring system that tracks safety metrics across all layers. Two metrics get special attention in our weekly engineering sync: the guardrail trigger rate trend and the user retry rate. A sudden spike in trigger rates could mean a new attack pattern, a change in user behavior, or a regression in our classifiers. A rising user retry rate means our guardrails are blocking legitimate requests, which degrades the product experience.

We run statistical anomaly detection on the distribution of model outputs — this catches subtle shifts that individual guardrails might miss. We get a few alerts per week, of which roughly one per week turns out to be a genuine issue requiring investigation.

In six months of operation, we've had two high-severity alerts. One was a false alarm caused by a partner system sending malformed data. The other was a genuine incident where a prompt template change inadvertently weakened a safety constraint, causing a spike in off-topic outputs. We caught it within about 12 minutes and rolled back the change within 25 minutes. No problematic content reached end users. I attribute that to the monitoring, not to luck.

The Organizational Tension

Product teams have deadlines, quarterly targets, competitive pressure. Every safety check is a potential source of friction, latency, and blocked launches. I've had a product manager tell me, with genuine frustration, that our safety requirements added two weeks to their launch timeline.

They weren't wrong. Safety is expensive and slow. The reward for doing it well is that nothing bad happens, which is the hardest thing in the world to take credit for.

What's worked for us is making safety metrics visible and concrete — not abstract. We don't talk about safety in abstract terms. We talk about the 14,000+ problematic outputs our guardrails caught last year. We talk about the three specific incidents where our system prevented PII exposure. We talk about what those incidents would have cost if they'd reached users.

We've also embedded safety review into our sprint process rather than treating it as a separate gate. Safety requirements get defined alongside functional requirements, implemented alongside features, tested alongside everything else. This reduces the "safety is blocking my launch" friction because safety work is part of the launch, not a hurdle in front of it.

If I Were Starting Over

Three things would change.

I'd invest in evaluation infrastructure before building guardrails. We built guardrails first, then scrambled to measure whether they were working. That's backwards. You need to be able to measure safety before you can improve it.

I'd build a unified safety API from day one instead of the collection of microservices we ended up with. Our current architecture works, but it's operationally complex. A single safety service with configurable policies would be simpler to operate, easier to update, and faster to deploy across new applications.

And I'd hire a dedicated safety engineer earlier. We built this system with generalist engineers who learned as they went. They did excellent work. But a specialist would have helped us avoid several months of trial and error on classifier architectures, threshold tuning, and adversarial testing approaches. I'm still not sure we have the classifier thresholds exactly right, honestly.

AI safety is not a checkbox on a compliance form. It's a multi-layered engineering system that runs in production, on every request, in real time. It costs money and adds latency and creates organizational friction. And it's absolutely worth it — because the first time a production LLM leaks customer data or generates a hallucinated fact that someone acts on, the cost of building guardrails after the fact will be orders of magnitude higher than building them in from the start.

References

  1. Anthropic, "Challenges in Red Teaming AI Systems," Anthropic Research, 2024.
  2. Inan, H., et al., "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," Meta AI Research, 2023.
  3. Greshake, K., et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," arXiv:2302.12173, 2023.
  4. OWASP, "OWASP Top 10 for Large Language Model Applications," OWASP Foundation, 2025.
  5. NIST, "AI Risk Management Framework (AI RMF 1.0)," National Institute of Standards and Technology, 2023.
  6. Rebedea, T., et al., "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails," NVIDIA, 2023.
  7. Alon, G., and Kamfonas, M., "Detecting Language Model Attacks with Perplexity," arXiv:2308.14132, 2023.
  8. EU AI Act, "Regulation Laying Down Harmonised Rules on Artificial Intelligence," European Parliament, 2024.
← Back to Home