Observability for LLM-Powered Systems

Traditional observability tells you what your system is doing: requests per second, error rates, latency percentiles. LLM observability must also tell you what your system is thinking—and whether that thinking is any good.

This is the gap I see in most LLM deployments. Teams instrument their APIs but treat the model as a black box. When something goes wrong, they have logs but no insight. Here is how I approach LLM observability.

What to Log

For every LLM call, I capture:

Request metadata: Timestamp, user ID, session ID, request ID, environment, model identifier.
Input: The full prompt, including system message and any retrieved context. Redact PII but preserve structure.
Output: The complete model response, including any tool calls or structured outputs.
Token counts: Input tokens, output tokens, total tokens. Essential for cost tracking and quota management.
Latency breakdown: Time to first token, total generation time, any retrieval or preprocessing latency.
Model parameters: Temperature, max tokens, any other settings that affect output.
Outcome signal: Did the user accept the response? Did they regenerate? Did they edit heavily? This is your ground truth.

Storage costs for full prompt logging are real but manageable. Compress aggressively, sample if necessary, but do not fly blind.

Building an LLM Cost Dashboard

LLM costs can spiral without warning. I build dashboards that answer:

What is our daily/weekly/monthly spend? Track actuals against budget with alerts at 50%, 75%, 90% thresholds.
Where is the spend going? Break down by feature, user segment, and prompt type. One chatty feature can dominate costs.
What is our cost per outcome? Not cost per request—cost per successful task completion. This is the metric that matters.
Are we getting more expensive over time? Track cost per request trends. Prompt drift or feature changes can inflate costs silently.

I also track "waste metrics":

Regeneration rate (user asked for a new response)
Abandonment rate (user did not use the response)
Error rate (model returned invalid output)

High waste means we are burning tokens without delivering value. This is where optimization efforts should focus.

Detecting Model Drift

Models change. Providers update weights, adjust safety filters, and modify behavior—often without announcement. I have seen production accuracy drop 15% overnight due to a silent model update.

To detect drift, I maintain:

Golden test sets. A fixed set of inputs with known-good outputs. Run these daily and alert on divergence. This catches gross regressions.
Distribution monitoring. Track output characteristics over time: average response length, vocabulary diversity, refusal rate, formatting consistency. Significant shifts warrant investigation.
A/B shadow testing. Route a sample of traffic to a pinned model version. Compare outputs to the live model. Divergence indicates drift.
User feedback correlation. If user satisfaction drops without code changes, suspect model drift. Correlate feedback signals with model version.

Hallucination Detection

Hallucinations are the defining failure mode of LLMs. I approach detection at multiple levels:

Retrieval grounding. If the output should be based on retrieved documents, verify that claims appear in the source material. Flag outputs that introduce unsupported facts.
Consistency checks. Ask the same question multiple ways. If answers contradict, at least one is hallucinated.
Confidence calibration. Some models expose log probabilities. Low confidence on factual claims correlates with hallucination risk.
Human review sampling. Randomly sample outputs for expert review. This is expensive but provides ground truth for training automated detectors.

I tag suspected hallucinations in logs and track rates over time. A rising hallucination rate is an early warning sign.

The Observability Stack

My typical LLM observability stack:

Structured logging: JSON logs with consistent schema, shipped to a central store (Elasticsearch, Loki, or a managed service).
Metrics: Prometheus/Grafana for time-series data. Custom metrics for token counts, latency, costs, and quality signals.
Traces: OpenTelemetry for request tracing. Critical when LLM calls are part of larger workflows with retrieval, preprocessing, and post-processing steps.
Specialized LLM tools: Platforms like Langfuse, Helicone, or custom solutions that understand prompt/completion semantics.
Alerting: PagerDuty or similar, with alerts for cost spikes, latency degradation, error rate increases, and drift detection.

Privacy Considerations

Full prompt logging creates privacy obligations:

PII redaction: Strip names, emails, phone numbers, and other identifiers before logging. Use deterministic hashing if you need to correlate across sessions.
Retention policies: Define how long you keep logs. Shorter is better for privacy; longer is better for analysis. Find your balance.
Access controls: Not everyone needs to see raw prompts. Implement role-based access with audit trails.
User consent: Be transparent about what you log. If users can opt out, honor that preference.

The Feedback Loop

Observability is not just for debugging. It is the foundation for improvement:

Identify failures through monitoring and alerting.
Diagnose root causes using detailed logs and traces.
Implement fixes (prompt changes, model swaps, guardrails).
Verify improvements through A/B testing and metric tracking.
Repeat.

Without observability, you are guessing. With observability, you are engineering. The difference shows in production reliability.