LLM Monitoring: Track Latency, Costs, and Errors in Production

What is LLM Monitoring?

LLM monitoring is the practice of continuously observing the behavior, performance, and cost of large language model API calls in production. Every time your application sends a request to OpenAI, Anthropic, Google, or any other LLM provider, that request has measurable properties — how long it took, whether it succeeded, how many tokens it consumed, and what it cost. LLM monitoring captures these signals in real time and surfaces them as dashboards, trends, and alerts.

Traditional application monitoring tracks request latency, HTTP status codes, and throughput. LLM monitoring adds a layer specific to AI workloads: token-based cost calculation, model-specific performance tracking, and per-customer cost attribution. A standard API call that returns 200 OK tells your APM tool everything is fine. LLM monitoring tells you that same call consumed 4,000 output tokens on a frontier model and cost $0.12 — information that is invisible to traditional monitoring tools.

The goal is operational awareness. You want to know, at any given moment, whether your LLM-powered features are working, how fast they are responding, how much they are costing, and whether anything has changed. Without monitoring, the first sign of a problem is usually the monthly bill — or a customer complaint about slow responses.

Key Metrics for LLM Monitoring

Latency (p50 / p95 / p99) measures how long each LLM call takes to return a response. The median (p50) tells you the typical experience. The p95 and p99 tell you about the tail — the slowest requests that affect your most patient users. LLM latency is highly variable: a short classification prompt might return in 200ms, while a long document analysis with streaming disabled might take 15 seconds. Monitoring percentiles separately prevents slow outliers from hiding behind a healthy average.

Error rates include HTTP failures, provider timeouts, rate limit hits (429s), and content filter rejections. LLM APIs are less reliable than most SaaS APIs — provider outages, rate limits, and model deprecations all cause errors that your application needs to handle. Monitoring error rates by model and by provider lets you spot degradation before it affects users.

Token usage per request drives cost directly. Input tokens and output tokens are priced differently — output tokens are typically 3x to 5x more expensive. Monitoring token usage by feature and by customer reveals where your spend is concentrated. A feature that generates long responses (code generation, document drafting) has a fundamentally different cost profile than one that generates short responses (classification, extraction), even if both make the same number of API calls.

Cost per request is the metric most teams overlook. A request can succeed with low latency and still cost 40x more than necessary because the wrong model was selected. Monitoring cost per request — broken down by model, feature, and customer — is what separates LLM monitoring from general application monitoring. It is the one metric that directly affects your margin.

Model drift refers to changes in model behavior over time. Providers update models, deprecate versions, and adjust rate limits without notice. Monitoring for sudden changes in latency, token usage, or error rates after a provider-side update helps you detect drift before it impacts quality or cost. If your average output token count jumps 30% overnight on the same prompts, something changed on the provider side.

LLM Monitoring Approaches Compared

Capability	APM Tools Datadog, New Relic	LLM Observability Helicone, Langfuse	Cost Monitoring MarginDash
Latency tracking			No
Error rate monitoring			No
Token-based cost calculation	No
Per-customer cost attribution	No	No
Revenue / margin tracking	No	No
Prompt tracing / debugging	No		No
Cost simulator	No	No
Budget alerts	Custom
Pricing database	No	No

These approaches are complementary, not mutually exclusive. APM tools handle infrastructure-level monitoring. LLM observability tools handle prompt-level debugging. Cost monitoring tools handle unit economics. Most production deployments benefit from at least two.

Why Cost is the Most Overlooked Monitoring Metric

Most monitoring setups track latency and errors but ignore cost entirely. With traditional APIs, the cost per request is effectively zero. With LLM APIs, every request has a variable cost determined by the model, the token count, and the provider's pricing. Cost can spike without any errors, without any latency increase, and without any alerts firing.

The most common cost spikes happen when a feature sends longer prompts than expected, when retry logic re-sends expensive requests on transient failures, or when a customer triggers high-token responses through normal usage. None of these produce errors. The API returns 200 OK. Your existing monitoring says everything is healthy. Your bill says otherwise.

For teams that resell AI features to customers, cost monitoring becomes a unit economics problem. You need to know cost per customer — not just aggregate cost. Without per-customer cost monitoring, you cannot identify which customers are profitable and which are underwater. MarginDash's cost simulator reprices your actual token usage against every model in the pricing database and ranks alternatives by intelligence-per-dollar.

Setting Up Alerts and Budgets

Monitoring without alerts is just a dashboard you check when something already broke. The value of LLM monitoring is catching problems before they become expensive. Budget alerts let you define spending thresholds and get notified before they are exceeded — not after.

Per-customer budgets protect you from individual customers consuming disproportionate resources. If you charge a flat $49/month and one customer is burning $40 in API calls, you want to know immediately — not at the end of the billing cycle. Set a threshold at, say, 60% of their subscription price and get alerted when cost approaches the break-even point.

Per-feature budgets help you track which AI-powered features are the most expensive. If your chat feature costs 5x more than your summarization feature per request, that information affects product decisions — pricing tiers, usage limits, model selection. Feature-level budgets surface these patterns automatically.

Organization-wide budgets are the safety net. Set a monthly spending cap across all customers and features. If total LLM spend exceeds a threshold — or is trending toward exceeding it — you get notified with time to react. This is especially important early in a product launch when usage patterns are unpredictable.

Monitoring vs. Observability: Different Problems, Different Tools

Monitoring answers: is it working? It tracks quantitative metrics — latency, error rates, cost, throughput — and alerts you when something deviates from normal. Monitoring is about detection. You define what "healthy" looks like and get told when the system drifts from that baseline.

Observability answers: why isn't it working? It provides the tools to investigate — prompt traces, input/output logging, evaluation scoring, chain visualization. Observability is about diagnosis. When monitoring tells you that latency spiked at 3pm, observability tools let you drill into the specific requests that caused it and understand why.

In practice, most teams need both. A monitoring tool tells you that cost per request increased 40% this week. An observability tool tells you that a prompt template change caused longer outputs. A monitoring tool tells you that a specific customer is unprofitable. An analytics tool tells you which features they use most and what model swap would fix the margin.

The overlap between these categories is growing. Some observability tools now include basic cost tracking. Some monitoring tools are adding trace-level detail. But the core distinction remains: monitoring is ongoing and automated (dashboards, alerts, trends), while observability is ad-hoc and investigative (traces, logs, evaluations). Choose your tools based on which problem you face most often.

Monitor LLM Costs in Production

MarginDash tracks cost per request, cost per customer, and margin across 400+ models from OpenAI, Anthropic, Google, and more. Set budget alerts per customer, per feature, or org-wide. Use the cost simulator to find cheaper models without sacrificing quality.

Start Monitoring LLM Costs →

No credit card required

Related Guides

LLM Observability

Prompt tracing, evaluation, and debugging for LLM applications. Understand why your AI features behave the way they do.

LLM Analytics

Track costs, usage, and performance across models and customers. Turn raw usage data into actionable business metrics.

LLM Guardrails

Safety filters, content moderation, and output validation for production LLM deployments.

AI Cost Management

Strategies for controlling and optimizing AI API spend. Cost simulators, model swaps, and budget controls.

Frequently Asked Questions

What is LLM monitoring?

LLM monitoring is the continuous observation of large language model deployments in production. It tracks key operational metrics — latency, error rates, token usage, and cost per request — to ensure AI-powered features are performing reliably and within budget. Unlike debugging or evaluation tools, monitoring focuses on ongoing production health rather than development-time analysis.

What metrics should I monitor for LLMs in production?

The most important metrics are latency (p50, p95, and p99 response times), error rates (HTTP failures, timeouts, rate limits), token usage per request, cost per request, and cost per customer. Cost is the most commonly overlooked metric — a request can succeed with low latency and still cost 40x more than necessary because of the model selection.

How is LLM monitoring different from LLM observability?

Monitoring answers 'is it working?' — it tracks metrics, detects anomalies, and fires alerts. Observability answers 'why isn't it working?' — it provides prompt traces, evaluation scores, and debugging tools. You need both, but they solve different problems. Monitoring tools like MarginDash focus on cost and operational health. Observability tools like Langfuse and LangSmith focus on prompt-level debugging and evaluation.

How do I set up cost monitoring for LLM API calls?

With MarginDash, you add a few lines of SDK code (TypeScript or Python) after each API call to log the model name, token counts, and customer ID. The SDK never sees your prompts or responses. Cost is calculated server-side using a pricing database that covers 400+ models and updates daily. Setup takes about 5 minutes and you see cost data immediately.

Can I set up budget alerts for LLM costs?

Yes. MarginDash lets you set budget thresholds per customer, per feature, or across your entire organization. You receive an email before spending exceeds the threshold — so you catch cost spikes before the monthly bill arrives. This is especially important for customers on flat-rate pricing, where a single heavy user can consume your entire margin.

Do I need a separate tool for LLM monitoring if I already use Datadog or New Relic?

APM tools like Datadog and New Relic can track latency and error rates for LLM API calls, but they have no concept of token-based pricing or per-model cost structures. A 200 OK response from GPT-4o and one from a model costing 1/40th as much look identical to an APM. LLM-specific monitoring tools add the cost dimension — cost per request, cost per customer, and cost optimization recommendations.