Observability for LLM-Powered Systems
Traditional observability tells you what your system is doing. LLM observability has to also tell you what your system is thinking — and whether that thinking is any good.
Post thisTraditional observability tells you what your system is doing: requests per second, error rates, latency. LLM observability has to also tell you what your system is thinking — and whether that thinking is any good.
That gap is where most teams get into trouble. They instrument their APIs carefully. They monitor uptime. And then they treat the model itself as a black box and hope nothing weird happens. For a while, nothing does. And then something does, and they have logs but no insight, and debugging becomes archaeology.
I learned this the expensive way.
The Incident That Changed How I Think About This
We had a summarization pipeline for a client — internal document review, nothing customer-facing, but the outputs fed into real decisions. The system ran quietly for several weeks after launch. Good feedback, no complaints, metrics looked fine.
Then we started getting escalations. Documents were being summarized with confident-sounding claims that weren't in the source material. Not obvious hallucinations — plausible-sounding additions that only showed up as wrong when someone cross-referenced the original. It had been happening for roughly ten days before anyone noticed.
We went to dig into the logs and found the problem: we'd been logging the final output but not the full prompt context. The model had silently started receiving malformed retrieved content — a bug in our chunking logic had been quietly introduced during a dependency update — and we had no way to reconstruct what the model had actually seen when it produced those outputs. We knew something was wrong. We couldn't prove what, or when it started, or which summaries were affected. The client had to treat roughly 300 documents as unverified and redo a significant chunk of the review.
That incident cost us real time and real trust. And it was entirely a logging problem. The model itself was doing what models do — producing fluent, confident text based on whatever context it received. We'd just stopped watching what context it was getting.
What to Log (Actually)
For every LLM call, I now capture the complete picture. Timestamp, user ID, session ID, request ID, environment, model identifier — the standard stuff. But more importantly: the full prompt, including the system message and all retrieved context. Not a summary. Not a hash. The actual text the model saw. Redact PII, but preserve structure.
Capture the complete model response, including any tool calls or structured outputs. Log token counts — input, output, total — because these drive cost tracking and quota management and will matter when you're trying to understand why spending spiked. Log latency broken down: time to first token, total generation time, retrieval latency if applicable. Log model parameters — temperature, max tokens, anything that affects output.
The one that teams most often skip: outcome signal. Did the user accept the response? Did they regenerate? Did they edit heavily? This is your actual ground truth about whether the system is working. Everything else is just technical telemetry. The outcome signal is the business signal.
Storage costs for full prompt logging are real. Compress aggressively, sample if volume is enormous, but genuinely do not fly blind. The incident I described above had a six-figure remediation cost. The storage for complete logging would have been a few hundred dollars a month.
Cost Tracking That Actually Helps
LLM costs can spiral without warning. This is not theoretical — I've seen monthly bills double in three weeks because a single new feature with a verbose prompt structure got more adoption than expected.
The dashboard I maintain answers a few core questions. What's our daily, weekly, and monthly spend, tracked against budget with alerts at meaningful thresholds? Where is the spend actually going — broken down by feature, user segment, and prompt type? One verbose feature can dominate the cost profile in ways that aren't visible at the aggregate level.
Most importantly: what's the cost per successful outcome, not cost per request. Cost per request tells you about volume. Cost per successful outcome tells you about efficiency. A feature with high cost per request but excellent completion rates might be perfectly acceptable. A feature with low cost per request but 60% abandonment is burning money without delivering value.
I also track what I think of as waste signals: regeneration rate (user asked for a new response, implying the first one wasn't useful), abandonment rate (user didn't use the response at all), error rate. High waste is where your optimization work should focus. It means you're burning tokens without delivering value.
Detecting Model Drift
Models change. Providers update weights, adjust safety filters, modify behavior — often without announcement, sometimes without changelog entries. I've seen production accuracy drop noticeably overnight due to a silent model update, and it took us longer than it should have to realize the model was the cause rather than something in our own stack.
The mechanism I rely on most: golden test sets. A fixed collection of inputs with known-good outputs, run daily, with alerts on divergence. This catches gross regressions quickly. It won't catch subtle behavioral shifts, but it catches the ones that actually affect user outcomes.
Distribution monitoring is the next layer. Track output characteristics over time — average response length, refusal rate, formatting consistency. Significant shifts in these distributions warrant investigation even if the golden tests pass. The refusal rate one has caught real issues for me; if it jumps by 20% overnight, something changed, and you want to know what before users start calling.
If you can afford it: run a sample of production traffic against a pinned model version in shadow mode. Compare outputs. Divergence indicates drift. This is operationally more expensive but gives you the most direct signal.
And watch user satisfaction metrics for unexplained changes. If feedback scores drop without any code changes on your end, suspect the model before you suspect your product.
Hallucination Detection
Hallucinations are the defining failure mode of LLMs in production, and they're harder to detect systematically than most teams expect.
For RAG-based systems where outputs should be grounded in retrieved documents: verify that factual claims appear in the source material. Flag outputs that introduce information not present in the context. This doesn't catch all hallucinations, but it catches the most common class — the model filling in gaps with plausible but unsupported facts.
Consistency checks help: ask the same question multiple ways and flag contradictions. It's computationally expensive so you sample rather than run it on everything, but it catches a real category of problem.
Human review sampling — randomly reviewing a small percentage of outputs with domain experts — is expensive and doesn't scale, but it provides ground truth that lets you calibrate your automated detectors. Without it, you don't know how well your automated detection is actually working.
Tag suspected hallucinations in logs and track rates over time. A rising hallucination rate is an early warning sign that something in the system needs attention, whether that's a model change, a retrieval problem, or a prompt that's started producing unreliable outputs.
The Stack
What I actually use: structured JSON logs with a consistent schema, shipped to a central store (I've used Elasticsearch and Loki depending on the client's existing infrastructure). Prometheus and Grafana for time-series metrics — token counts, latency, costs, quality signals. OpenTelemetry for distributed tracing when LLM calls are part of larger workflows with retrieval, preprocessing, post-processing. Langfuse for prompt/completion-level visibility when the team needs to iterate on prompts without digging through raw logs.
Alerting for: cost spikes, latency degradation, error rate increases, and the drift detection signals mentioned above. PagerDuty or equivalent depending on severity thresholds.
The key principle: use observability tooling that understands prompt/completion semantics, not just generic logging infrastructure. Generic logs will tell you a request took 3.2 seconds. LLM-aware observability will tell you the prompt was 8,000 tokens, the retrieval stage contributed 1.4 seconds, and the user abandoned the output after 30 seconds — all of which matter for different reasons.
Privacy and the Tension It Creates
Full prompt logging creates real privacy obligations. This tension doesn't resolve cleanly and I want to be honest about that.
PII redaction before logging is mandatory. Names, emails, phone numbers, account numbers — strip them or hash them. If you need to correlate sessions, use deterministic hashing rather than storing the raw identifier. Define retention policies explicitly; most teams don't, and then discover they've been storing sensitive conversation logs for years with no justification. Implement role-based access — not everyone who can read application logs should be able to read user prompts.
Be transparent with users about what you log. If opt-out is offered, honor it and ensure that the opt-out actually works in your pipeline, not just in your privacy policy.
I don't have a clean answer to the tension between "log everything for full debuggability" and "minimize data retention for privacy." The right balance depends on regulatory context, user expectations, and what your audit requirements actually are. What I do know is that you need to make this decision explicitly rather than defaulting to "log everything and figure it out later."
The Loop
Observability isn't just for debugging individual incidents. It's the feedback mechanism that lets you improve the system over time. You identify failures through monitoring, diagnose root causes with detailed logs and traces, implement fixes, verify improvements through controlled testing, and repeat.
The teams I've seen run reliable LLM systems in production treat observability as infrastructure, not as something you set up during incident response. Without it, you're guessing. The difference between guessing and engineering is what shows up in production six months after launch.