Metrics & Instrumentation for Rate Limiting
A rate limiter you cannot observe is a rate limiter you cannot tune, and this guide sits under the Observability & Operations area because the limiter is one of the few components on the hot path whose job is to reject traffic — which means its metrics look alarming by design. The whole problem of instrumenting a limiter is separating the signal you want (abuse being shed, quotas being enforced) from the signal you fear (legitimate users blocked, the store failing, the limiter failing open and enforcing nothing). You solve that by measuring the right five things, with the right metric types, under labels that stay bounded as traffic grows.
This guide defines the metric vocabulary, gives a recommended metric set with concrete label schemas, walks through instrumenting a limiter end to end, and covers the one mistake that takes a metrics pipeline down faster than any traffic spike: unbounded label cardinality. Two child guides go deeper — one on the exact Prometheus metrics for rate limiting you define in code, and one on building Grafana rate limit dashboards from those metrics.
The five things worth measuring
Every other dashboard panel and alert in this area is built from a small, stable set of signals. Measure these and you can answer almost any operational question about the limiter; skip them and you are debugging blind.
- Allowed vs blocked request counts. The single most important signal: how many requests the limiter let through and how many it rejected with
429. The ratio of these — the block rate — is what dashboards and alerts ultimately watch. Split by decision so you can computeblocked / (allowed + blocked)directly. - Limit utilization (remaining / limit). How close clients run to their ceiling before they get rejected. A fleet sitting at 95% utilization is one traffic bump away from mass
429s; a fleet at 10% has headroom. This is a gauge or a histogram of theX-RateLimit-Remainingyou already compute. - Limiter decision latency. How long the allow/deny decision itself takes. For an in-memory check this is microseconds; for a Redis-backed check it is the round-trip, typically 0.2–1 ms intra-AZ. Latency that climbs is your earliest warning that the store is struggling.
- Store (Redis) errors and latency. Connection errors, timeouts, and command latency against the backing store. A limiter is only as available as the store it consults, so store health is limiter health. The Redis counter architecture is the dependency you are monitoring here.
- Fail-open events. Every time the limiter could not reach the store and let the request through anyway (or rejected it, if you fail closed). This is the most dangerous blind spot: when the limiter fails open, block counts drop to zero and a naïve dashboard looks healthy while no limiting is happening at all. Count fail-open decisions explicitly so you can alert on them.
Choosing the metric type
Prometheus offers three instrument types that matter here. Picking the wrong one either loses information or wastes storage.
| Metric type | Use it for | Why | Example |
|---|---|---|---|
| Counter | Things that only go up: requests allowed, requests blocked, store errors, fail-open events | Monotonic; you take rate() over it to get per-second throughput and ratios |
ratelimit_requests_total{decision="blocked"} |
| Histogram | Distributions you need quantiles of: decision latency, store latency, utilization | Buckets let you compute p50/p99 with histogram_quantile() across the fleet |
ratelimit_decision_duration_seconds |
| Gauge | Instantaneous values that go up and down: current remaining tokens, in-flight requests, configured limit | Last-write-wins sample; not aggregatable the way counters are | ratelimit_remaining{tier="pro"} |
The trap is reaching for a gauge to track block rate. A gauge you set to “current blocks per second” loses everything between scrapes and cannot be summed across instances. Use a counter and let rate() do the work — that is the whole reason counters exist.
There is a fourth Prometheus type, the summary, that you should almost always decline for limiter latency. A summary computes quantiles client-side and ships them pre-baked, which means you cannot aggregate them across instances — a p99 from each of twenty pods does not average into a fleet p99. A histogram ships raw bucket counts, and histogram_quantile(0.99, sum(rate(..._bucket[5m])) by (le)) reconstructs a true fleet-wide p99 across every instance. For a limiter, where the question is almost always “how slow is the decision across the fleet,” the histogram is the only type that answers it correctly. Reserve summaries for the rare case where you need an exact quantile from a single process and will never aggregate.
A subtler choice is when to model utilization as a gauge versus a histogram. A gauge of remaining answers “what is the latest reading” and is enough for a single panel, but it cannot tell you how the fleet is distributed: a gauge averaging 50% utilization is consistent with everyone at 50% or with half the fleet at 0% and half pinned at 100% — operationally opposite situations. When you need to see that distribution — to catch the cohort of keys that is one bump away from mass 429s while the average looks calm — record utilization as a histogram and read it as a heatmap. Use the gauge for the cheap latest-value panel; reach for the histogram the moment “the average is fine but someone is hurting” becomes a real question.
Recommended metric set
This is the concrete set to emit. Names follow Prometheus conventions (_total suffix on counters, _seconds on duration, base unit, no units in the value). Every name is prefixed ratelimit_ so it groups cleanly and never collides with framework metrics.
| Metric name | Type | Labels | Meaning |
|---|---|---|---|
ratelimit_requests_total |
counter | route, method, tier, decision |
Every limiter decision; decision ∈ allowed/blocked. The block ratio comes from here. |
ratelimit_remaining |
gauge | route, tier, key_class |
Tokens/quota left at last decision. Drives utilization panels. |
ratelimit_limit |
gauge | route, tier |
Configured ceiling for the window. Pair with remaining for % utilization. |
ratelimit_decision_duration_seconds |
histogram | route, decision |
Wall-clock time of the allow/deny decision, including store round-trip. |
ratelimit_store_duration_seconds |
histogram | op |
Latency of the backing-store command (eval, incr, get). |
ratelimit_store_errors_total |
counter | op, error |
Store failures by operation and error class (timeout, conn, script). |
ratelimit_fail_open_total |
counter | route, reason |
Requests admitted because the store was unreachable. Alert on any rate > 0. |
ratelimit_utilization_ratio |
histogram | route, tier |
1 - remaining/limit at decision time. Heatmap reveals cohorts pinned near the ceiling. |
ratelimit_retry_after_seconds |
histogram | route |
The Retry-After value handed back on a 429. Spikes mean clients are being told to wait longer. |
ratelimit_config_reload_total |
counter | result |
Limit/policy config reloads, result ∈ ok/error. Correlate a reject-ratio change with a config change. |
ratelimit_active_keys |
gauge | tier |
Approximate distinct keys seen in the window (an estimate, not a per-key label). Capacity signal. |
Tier and route give you the slices operators actually want: “is the free tier getting hammered on /search?” without exploding cardinality. Note ratelimit_active_keys is a single gauge per tier holding an estimated count — it is the safe way to track “how many keys are active” without the catastrophe of a per-key label, and it is usually fed from a probabilistic counter (HyperLogLog) maintained in the store rather than an exact set.
A few of these earn a word on when to add them. The first seven rows above are the baseline every limiter should emit from day one. ratelimit_utilization_ratio becomes valuable once you have tiers and want to forecast saturation before it bites. ratelimit_config_reload_total pays for itself the first time a bad deploy lowers a quota and you need to prove the reject spike began at the reload, not at a traffic change. ratelimit_retry_after_seconds matters when you compute dynamic Retry-After values and want to confirm clients are not being told to back off for absurd durations.
Label design and cardinality control
Cardinality — the number of distinct label-value combinations — is the one thing that turns a metrics pipeline from an asset into an outage. Each unique combination is a separate time series Prometheus must store, index, and scan. A handful of labels with a few values each is cheap; one label with millions of values will exhaust memory on the Prometheus server and on every scraped process.
The cardinality of a metric is the product of its labels’ cardinalities. ratelimit_requests_total{route, method, tier, decision} with 40 routes × 5 methods × 4 tiers × 2 decisions is 1,600 series — trivial. Add one bad label and the math changes catastrophically.
The cardinal sin: never label by raw API key or client identifier. A label like api_key="acct_8f3a92…" or user_id or ip has unbounded cardinality — it grows with your customer base. With 200,000 active keys, that single label multiplies every metric it touches by 200,000. This is the most common way teams take down their own monitoring.
Safe label design follows a few rules:
- Label by
key_class, not by key. Bucket clients into a small fixed set —anonymous,free,pro,enterprise,internal— and label by that. You keep the slice you care about (which tier is being throttled) without unbounded growth. - Use route templates, not raw paths. Label
route="/v1/users/:id", neverroute="/v1/users/8f3a92". Raw path parameters are unbounded the same way keys are. Most frameworks expose the matched route template. - Keep
decision,method,tierlow-cardinality by construction. These are enumerations with a handful of values. They are safe to multiply. - Pre-aggregate the high-cardinality questions. “Which specific key is over limit?” is a logs/events question, not a metrics question. Emit a structured log line for the offending key and keep the metric labelled by class. Dashboards answer “which key class / route”, logs answer “which exact key”.
- Bound histogram buckets. Each histogram bucket is also a series (per label combo). Ten well-chosen buckets are plenty; do not define forty.
A useful sanity check: estimate series = Σ over metrics of (∏ label cardinalities × buckets). If any single metric exceeds a few thousand series, find the unbounded label — it is almost always an identifier that slipped in.
Made concrete, the difference between a label that costs you nothing and one that takes down Prometheus is rarely subtle once you see them side by side:
| Bad label (unbounded) | Why it explodes | Good label (bounded) | Bounded count |
|---|---|---|---|
api_key="acct_8f3a92e1" |
One series per customer; grows with signups forever | key_class="enterprise" |
~5 classes |
route="/v1/users/8f3a92" |
One series per path parameter value | route="/v1/users/:id" |
~40 templates |
client_ip="203.0.113.7" |
Effectively unbounded; one series per visitor | ip_version="v4" or omit |
~2 (or drop) |
user_agent="curl/8.4.0 ..." |
Thousands of distinct UA strings | client_kind="sdk"/browser/bot |
~4 buckets |
error="ECONNREFUSED at :6379 ..." |
Free-text message; near-infinite variants | error="conn"/timeout/script |
~4 classes |
region="us-east-1a-rack42-host09" |
Per-host explosion | region="us-east-1" |
~10 regions |
The pattern across every row is the same: the bad label encodes identity (which key, which IP, which exact host, which raw message), and the good label encodes a class the identity falls into. The fix is never “drop the metric” — it is “bucket the identity into a fixed enumeration and put the raw identity in a log instead.” When a stack trace or a raw key genuinely needs to be findable, that is a structured log line keyed by trace ID, not a metric label.
One more failure pattern deserves naming because it hides: a label that is bounded today but unbounded over time. version="v1.4.2-rc3" looks like a small enumeration, but every deploy adds a new value and old series never disappear, so over months the cardinality creeps up unbounded. The same applies to feature_flag, experiment, or any label whose value space grows with releases. Treat slowly-unbounded labels with the same suspicion as obviously-unbounded ones; the only difference is how long the outage takes to arrive.
Instrumentation walkthrough
The mechanics are the same in any language: create the instruments once at startup, then update them on the decision path and expose them on an HTTP endpoint the scraper pulls.
// Node + prom-client: define instruments once, update on every decision.
import { Counter, Histogram, Gauge, register } from "prom-client";
const requests = new Counter({
name: "ratelimit_requests_total",
help: "Limiter decisions by route, tier and outcome",
labelNames: ["route", "method", "tier", "decision"] as const,
});
const decisionDuration = new Histogram({
name: "ratelimit_decision_duration_seconds",
help: "Time to make an allow/deny decision incl. store round-trip",
labelNames: ["route", "decision"] as const,
// Buckets span sub-ms local checks through slow Redis round-trips.
buckets: [0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1],
});
const failOpen = new Counter({
name: "ratelimit_fail_open_total",
help: "Requests admitted because the store was unreachable",
labelNames: ["route", "reason"] as const,
});
// On the hot path: time the decision, record the outcome, classify by template.
export async function limited(route: string, method: string, tier: string, check: () => Promise<{ ok: boolean }>) {
const end = decisionDuration.startTimer({ route }); // labels finalized at end()
try {
const { ok } = await check();
const decision = ok ? "allowed" : "blocked";
requests.inc({ route, method, tier, decision }); // route is a TEMPLATE, never a raw path
end({ decision });
return ok;
} catch (err) {
// Store unreachable: fail open and COUNT it so the gap is visible.
failOpen.inc({ route, reason: (err as Error).name });
end({ decision: "allowed" });
return true;
}
}
Three details earn their keep here: the histogram timer is started before the check and finalized with the decision label only once the outcome is known; route is always the matched template; and the catch block increments ratelimit_fail_open_total rather than silently letting traffic through — that counter is the only thing standing between you and an invisible outage.
The exposition endpoint and scrape configuration are deliberately boring:
// Expose for Prometheus to pull. Keep /metrics off the public router.
import express from "express";
const app = express();
app.get("/metrics", async (_req, res) => {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
});
The full per-language setup — including the Python prometheus_client equivalent, the exact bucket choices, and the scrape config — is in Prometheus metrics for rate limiting.
Exemplars: bridging the metric to the trace
The cardinality discipline above buys safety at a cost: your metrics deliberately forget identity, so when a histogram shows p99 decision latency spiking you can see that it is slow but not which request was slow. Exemplars close that gap without reintroducing high-cardinality labels. An exemplar is a single example data point — a trace ID and a value — attached to a histogram bucket observation. The bucket count stays a bounded aggregate; the exemplar rides alongside it as a pointer to one concrete request that landed in that bucket. Click the spike on a latency panel, follow the exemplar’s trace ID into your tracing backend, and you are looking at the exact slow decision, store round-trip and all.
// prom-client supports exemplars on histogram observations.
// Attach the active trace/span id so a slow bucket links to its trace.
import { trace } from "@opentelemetry/api";
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
// observe with an exemplar: the bucket count is still bounded-cardinality,
// but THIS observation carries a pointer to a real request.
decisionDuration.observe(
{ route, decision },
elapsedSeconds,
traceId ? { trace_id: traceId } : undefined,
);
Exemplars are the correct answer to “I want to know which request” precisely because they are not labels: a label trace_id="…" would mint a series per request and detonate the pipeline, whereas an exemplar is a sampled annotation that Prometheus stores in a separate, bounded sidecar store. This is the same identity/aggregate split the whole guide turns on — keep the metric a bounded aggregate, and let the exemplar (like a log) carry the high-cardinality pointer. Enable them with --enable-feature=exemplar-storage on Prometheus and surface them in Grafana’s panel options; the limiter’s decision-latency histogram is the single highest-value place to wire them up.
Distributed and scaling considerations
In a multi-node deployment every process exposes its own /metrics and Prometheus scrapes each independently, so counters are per-instance. You sum across instances at query time with sum(rate(ratelimit_requests_total[5m])), not by trying to share a registry. This is correct and cheap, but it has two consequences. First, restarts reset counters; rate() and increase() handle counter resets natively, so this is fine as long as you query with rate functions and never with raw deltas. Second, the instance label is added by Prometheus automatically, so keep it out of your application labels to avoid double-counting.
For limiters fronted by many ephemeral workers (serverless, autoscaled pods), per-instance scraping breaks down because instances come and go faster than the scrape interval. There, push the metrics through an aggregating gateway or emit them as events and aggregate downstream — but the label discipline above does not change. Cardinality is a property of the data, not the transport.
When cardinality blows up at scale
Cardinality problems rarely announce themselves at design time; they detonate when a label that was bounded in staging meets production’s full key space. The classic limiter blowup is the one already named — a key, ip, or raw route label that looked fine against a hundred test accounts and mints a million series against the real customer base. The symptoms are diagnostic: the scraped process’s /metrics payload balloons from kilobytes to megabytes, scrape duration climbs past the interval and Prometheus starts dropping scrapes, and the Prometheus server’s heap grows until it OOM-kills and loses recent data. By the time the server falls over, the very metrics you would use to diagnose the incident are the ones that took it down.
Guard against it before it ships. Most client libraries can cap the number of distinct label values per metric; set a ceiling and have overflow values collapse into an other bucket rather than minting new series unbounded. Run count by (__name__)({__name__=~"ratelimit_.*"}) periodically and alert if any limiter metric crosses a few thousand series — a tripwire that catches a slowly-unbounded label long before it is fatal. And review new labels in code review with one question: “what is the maximum number of distinct values this can take in production, ever?” If the answer is “grows with users/requests/deploys,” it belongs in a log.
Recording rules and federation
At fleet scale the expensive part shifts from storage to query. A dashboard that computes sum(rate(ratelimit_requests_total[5m])) by (tier) / sum(rate(...)) by (tier) over thousands of series on every refresh is slow and re-evaluates the same heavy aggregation for every viewer. Recording rules precompute these once per evaluation interval and store the result as a new, low-cardinality series:
# prometheus rules: precompute the block ratio per tier once,
# so dashboards and alerts read a cheap series instead of re-aggregating.
groups:
- name: ratelimit_aggregates
interval: 30s
rules:
- record: tier:ratelimit_block_ratio:rate5m
expr: |
sum(rate(ratelimit_requests_total{decision="blocked"}[5m])) by (tier)
/
sum(rate(ratelimit_requests_total[5m])) by (tier)
- record: route:ratelimit_decision_p99_seconds:5m
expr: |
histogram_quantile(0.99,
sum(rate(ratelimit_decision_duration_seconds_bucket[5m])) by (le, route))
Recording rules also stabilize alerting: a burn-rate alert that reads a precomputed ratio series is cheaper and more deterministic than one re-deriving it under load. The naming convention level:metric:operation (here tier:ratelimit_block_ratio:rate5m) signals at a glance what was aggregated away.
Federation enters when you run Prometheus per region or per cluster and need a global view. Each local Prometheus keeps the full-fidelity, high-volume series; a global Prometheus scrapes only the recording-rule outputs from each local one via /federate. This keeps the cross-region rollup small — you are federating a handful of pre-aggregated tier/route ratios, not millions of raw series. The discipline that makes this work is, again, the recording rule: federate aggregates, never raw counters, or the global instance inherits every local cardinality problem at once.
Failure modes and what each metric catches
| Failure mode | The metric that catches it | Symptom in the data |
|---|---|---|
| Misconfigured limit (too tight) | ratelimit_requests_total{decision="blocked"} |
Block ratio jumps for legitimate tiers, utilization pinned at 100% |
| Store outage | ratelimit_store_errors_total, ratelimit_fail_open_total |
Error counter climbs, fail-open counter > 0, block ratio drops to ~0 |
| Limiter latency creep | ratelimit_decision_duration_seconds p99 |
p99 climbs from sub-ms into tens of ms |
| Silent fail-open | ratelimit_fail_open_total |
Blocks fall to zero while traffic is steady — looks “healthy” without this counter |
| Abuse / scraping | ratelimit_requests_total{decision="blocked"} by key_class |
Block ratio spikes concentrated in anonymous/free classes |
The pattern worth internalizing: a drop in block rate is as suspicious as a spike. A spike usually means abuse is being shed correctly; a drop to zero often means the limiter stopped working. Only the fail-open and store-error counters let you tell those apart, which is exactly why they are non-negotiable.
There is also a class of failure in the instrumentation itself that no limiter metric will catch, because the metric is the thing that broke. Watch for these meta-failures:
- Scrape gaps. If Prometheus cannot scrape a process — the
/metricsendpoint hangs, the pod is unreachable, the scrape times out under cardinality pressure — you getNaNover that interval, andrate()silently produces gaps. A reject-ratio panel with a gap looks calmer than a panel showing a problem. Alert onup == 0for the limiter’s job and on scrape duration approaching the interval, so a blind spot pages instead of hiding. - Counter reset misread. Querying a counter with a raw subtraction instead of
rate()/increase()produces a huge negative spike on every deploy, which then triggers false alerts. Every limiter query over a counter must use a rate function; this is not optional polish, it is correctness. - Label drift between code and dashboards. Rename a label value (
blocked→denied) in code without updating PromQL, and every panel and alert keyed on the old value silently goes to zero — looking, again, healthier than reality. Pin label values as enumerations in code and treat changing one as a breaking change to the dashboards downstream.
The unifying lesson across both tables: in limiter observability the dangerous failures present as calm, not as alarm. A spike is loud and self-announcing; a gap, a silent fail-open, or a label that drifted to zero all make the dashboard look better. Build the fail-open counter, the up alert, and the rate-function discipline specifically to make these quiet failures loud.
Child guides
- Prometheus Metrics for Rate Limiting — the exact
prom-client/prometheus_clientinstruments, histogram buckets,/metricsexposition, and scrape config, as a runnable how-to. - Grafana Rate Limit Dashboards — the panels (429 rate, top limited routes, utilization heatmap, latency p99, store-error rate) and the PromQL behind each.
Related
- Observability & Operations — the parent area covering response headers, metrics, and alerting.
- Alerting & SLOs — turning these metrics into alerts and an error budget for the limiter.
- Redis Counter Architecture — the backing store whose errors and latency you are measuring.
- Prometheus Metrics for Rate Limiting — instrument a limiter step by step.