Red-Teaming Our Own AI: An Enterprise Safety Playbook

About six months ago, one of my engineers found a way to make our enterprise document processing system output fabricated compliance data. It took her roughly forty minutes. The system had been in production for three months at that point — it had passed functional testing, load testing, user acceptance testing, and a third-party security audit. Nobody had ever actually tried to make it lie.

That was the moment I decided we needed a real red-teaming practice. Not a one-off exercise. Not an annual checkbox. Something continuous and methodical — a habit of attacking our own systems before someone else does.

I want to share three specific things we found, because they're the kind of findings that don't show up in the academic papers. They're humbling. And they probably live in your system too.

How We Actually Do It

We run a full red-team cycle quarterly, with lighter testing in between. Four engineers, one dedicated week, trying to break things. The methodology is adapted from traditional penetration testing but modified for LLMs — we start with a threat model, build a structured attack plan, execute it in a mirrored production environment, document everything, and then fix what matters. That's two sentences describing what's actually several weeks of work per quarter. The point isn't the process, though — it's what you find when you take it seriously.

Finding One: Our Safety Filters Spoke English

Our system supports English, Hindi, and a few other Indian languages. The safety filters were solid in English. In Hindi, they caught maybe 80% of the same attack patterns. In Marathi, it was closer to 60%.

The problem wasn't the underlying model — it had decent multilingual capability. The problem was our filter layer, which we'd built and tested almost entirely in English. We'd assumed it would generalize. It didn't.

The fix wasn't just "write more Hindi test cases." We had to restructure the filter pipeline to operate on semantic representations rather than surface-level text matching. Detection at the semantic level is language-agnostic by design. It was a two-sprint effort, but it closed an entire class of vulnerability in one go rather than playing whack-a-mole across languages.

I'm still not sure we'd have caught this without the red-team exercise. It's not the kind of gap that shows up in normal functional testing.

Finding Two: "Conversational Boiling"

This one kept me up for a night. One of my engineers — and she deserves real credit for the creativity here — discovered she could gradually shift the model's behavior over a 15-to-20 turn conversation. Each individual turn was completely innocuous. The cumulative effect was that by turn 18 or so, the model would comply with requests it would flatly refuse if you asked cold.

It was social engineering, but against a language model. She called it ‘conversational boiling.’

Post this

She called it "conversational boiling." Like the frog that doesn't notice the water getting hotter. Each turn slightly redefined the context, introduced a small assumption, established a precedent the model would reference later. It was social engineering, but against a language model.

What made it scary wasn't that the technique was sophisticated — it was that it worked reliably, and none of our existing safety checks were looking at the conversation trajectory. They were looking at individual turns.

Our fix was a conversation-level safety check that periodically re-evaluates the full conversation arc against our safety criteria, not just the latest message. It adds latency — roughly 200 milliseconds every few turns — but it catches the gradual drift that turn-level filters miss entirely.

Finding Three: Our System Prompts Were Not a Secret

In our first red-team cycle, three out of four engineers independently extracted the full system prompt within the first hour. We were using a fairly standard instruction-following pattern, and the model was happy to repeat its instructions back when you asked the right way.

This matters more than it sounds. System prompts often contain business logic, safety rules, and operational details you'd rather not expose. In one case, ours included the specific criteria for flagging a document as "high risk" — which an adversarial user could then use to craft documents that deliberately dodge those criteria. Knowing the rules of the filter is the first step to bypassing it.

We moved sensitive logic out of the prompt and into application code. We also added prompt extraction detection that identifies common extraction patterns and deflects them — but we're explicit internally that this isn't the primary defense. Defense-in-obscurity alone is not a safety strategy. We'd rather have a system that's safe even if the system prompt leaks.

The Part Nobody Wants to Hear

The technical work of red-teaming is hard. The organizational work is harder.

Getting four engineers off feature work for a full week every quarter is expensive. The objection I hear from other engineering leads is always some version of: "We're shipping features. We don't have time for that." My response is always the same: you don't have time for a production incident where your enterprise AI generates fabricated financial data for a regulated client. The reputational and legal cost of that incident will dwarf what you spend on red-teaming by several orders of magnitude.

We've also learned that a red-team cycle that finds nothing is a failed cycle. It means we weren't creative enough. We've never had a zero-finding cycle — which I take as evidence we're being honest with ourselves about it.

If you're building enterprise AI and don't have something like this, I'd say: start small. Two engineers, three days, the OWASP Top 10 as your attack plan. Document what you find. Fix what matters. Then do it again next quarter, slightly bigger. The goal isn't perfection — it's finding your own problems before your users do, before your clients do, before a regulator does.

It's worth it. I say that having lived through the moment that convinced me.

References

  1. NIST. (2023). AI Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology. nist.gov
  2. OWASP. (2025). OWASP Top 10 for Large Language Model Applications. owasp.org
  3. Anthropic. (2024). Responsible Scaling Policy. anthropic.com
  4. Perez, E., et al. (2022). "Red Teaming Language Models with Language Models." arXiv preprint arXiv:2202.03286.
  5. Ganguli, D., et al. (2022). "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned." arXiv preprint arXiv:2209.07858.
  6. Greshake, K., et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv preprint arXiv:2302.12173.
  7. Microsoft. (2024). AI Red Team Playbook. Microsoft Security Engineering.
← Back to Home