Prometheus Metrics for Rate Limiting

Q: How many histogram buckets should I define?

Eight to ten, placed where your latency actually lives: tight resolution around the sub-millisecond to few-millisecond range for a Redis-backed decision, with a couple of coarse tail buckets to catch timeouts. Each bucket is a separate series per label combination, so more buckets is real cardinality cost.

This guide is the concrete build: take a working rate limiter and add the Prometheus instruments that let you see what it is doing in production. It sits under the Metrics & Instrumentation guide, which explains what to measure and why cardinality control matters; here you write the actual prom-client (Node) and prometheus_client (Python) code, choose histogram buckets, expose /metrics, and configure the scrape. The end state is a limiter whose allow/block ratio, decision latency, store health, and fail-open events are all queryable in PromQL.

What you are wiring up

A typical limiter handles, say, 2,000 rps across 6 pods, blocking maybe 3–5% of requests under normal load and spiking to 30%+ when a scraper hits the anonymous tier. You want to see that block ratio per route and tier, watch decision latency stay under 2 ms p99, and get an unmistakable signal the moment the limiter starts failing open. Four instruments cover it.

Instrument	Prometheus type	Answers the question
`ratelimit_requests_total`	Counter	What fraction of requests are we blocking, by route and tier?
`ratelimit_decision_duration_seconds`	Histogram	How slow is the allow/deny decision at p99?
`ratelimit_store_errors_total`	Counter	Is the backing store (Redis) erroring?
`ratelimit_fail_open_total`	Counter	Are we admitting traffic because the store is down?

Operator checklist

Add the Prometheus client library (prom-client for Node, prometheus_client Add the Prometheus client library (`prom-client` for Node, `prometheus_client` for Python).
Define the four instruments once at module load, never per-request.
Label only by bounded values: route template, method, tier, decision Label only by bounded values: `route` template, `method`, `tier`, `decision` — never raw API key.
Set explicit histogram buckets sized for your decision latency (sub-ms to tens of ms).
Increment ratelimit_fail_open_total Increment `ratelimit_fail_open_total` in the store-error path so fail-open is visible.
Expose /metrics Expose `/metrics` on an internal port or route, not the public API surface.
Add a scrape job in prometheus.yml Add a scrape job in `prometheus.yml` targeting every instance.
Verify with curl /metrics Verify with `curl /metrics` and a PromQL query before declaring it done.

Step 1 — Define instruments (Node, prom-client)

Create the registry and instruments once. Defining a metric twice throws, so keep this in a single module.

// metrics.ts — define once, import everywhere.
import { Counter, Histogram, register, collectDefaultMetrics } from "prom-client";

collectDefaultMetrics({ register }); // process + GC metrics, optional but useful

export const rlRequests = new Counter({
  name: "ratelimit_requests_total",
  help: "Rate limiter decisions by route, method, tier and outcome",
  labelNames: ["route", "method", "tier", "decision"] as const,
});

export const rlDecision = new Histogram({
  name: "ratelimit_decision_duration_seconds",
  help: "Wall-clock time of the allow/deny decision incl. store round-trip",
  labelNames: ["route", "decision"] as const,
  // Sub-ms local checks (0.5ms) through slow Redis round-trips (100ms).
  buckets: [0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1],
});

export const rlStoreErrors = new Counter({
  name: "ratelimit_store_errors_total",
  help: "Backing-store failures by operation and error class",
  labelNames: ["op", "error"] as const,
});

export const rlFailOpen = new Counter({
  name: "ratelimit_fail_open_total",
  help: "Requests admitted because the store was unreachable",
  labelNames: ["route", "reason"] as const,
});

Recommended histogram buckets

Buckets are the one tuning decision unique to histograms. Prometheus stores a cumulative count per bucket, and histogram_quantile() interpolates within the bucket the quantile falls in — so a quantile is only as precise as the bucket boundaries near it. Place boundaries where you actually care.

Metric	Buckets (seconds)	Rationale
`ratelimit_decision_duration_seconds`	`0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1`	Tight resolution from 0.5 ms (in-memory) through ~5 ms (healthy Redis) to 100 ms (degraded)
`ratelimit_store_duration_seconds`	`0.0005, 0.001, 0.002, 0.005, 0.01, 0.05, 0.25`	Redis round-trip lives near 1 ms; coarse tail catches timeouts

Eight to ten buckets is the sweet spot. Each bucket is a separate time series per label combination, so resist adding more “just in case” — that is cardinality spend with no payoff.

Step 2 — Instrument the decision path

Wrap the limiter check so it times the decision, records the outcome, and — critically — counts fail-open in the error branch.

// limiter.ts
import { rlRequests, rlDecision, rlStoreErrors, rlFailOpen } from "./metrics";

export async function decide(
  route: string, method: string, tier: string,
  check: () => Promise<{ ok: boolean }>,
): Promise<boolean> {
  const stop = rlDecision.startTimer({ route });   // decision label set at stop()
  try {
    const { ok } = await check();
    const decision = ok ? "allowed" : "blocked";
    rlRequests.inc({ route, method, tier, decision });
    stop({ decision });
    return ok;
  } catch (err) {
    const e = err as Error;
    rlStoreErrors.inc({ op: "decide", error: e.name || "unknown" });
    rlFailOpen.inc({ route, reason: e.name || "unknown" }); // make the gap visible
    stop({ decision: "allowed" });
    return true; // fail open
  }
}

Step 3 — Define instruments in Python (prometheus_client)

The same four instruments in prometheus_client, suitable for FastAPI, Django, or Flask. Buckets pass through the buckets= kwarg.

# metrics.py — define once at import time.
from prometheus_client import Counter, Histogram

rl_requests = Counter(
    "ratelimit_requests_total",
    "Rate limiter decisions by route, method, tier and outcome",
    ["route", "method", "tier", "decision"],
)
rl_decision = Histogram(
    "ratelimit_decision_duration_seconds",
    "Time of the allow/deny decision incl. store round-trip",
    ["route", "decision"],
    buckets=(0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1),
)
rl_store_errors = Counter(
    "ratelimit_store_errors_total",
    "Backing-store failures by operation and error class",
    ["op", "error"],
)
rl_fail_open = Counter(
    "ratelimit_fail_open_total",
    "Requests admitted because the store was unreachable",
    ["route", "reason"],
)

# limiter.py
import time
from metrics import rl_requests, rl_decision, rl_store_errors, rl_fail_open

async def decide(route: str, method: str, tier: str, check) -> bool:
    start = time.perf_counter()
    try:
        ok = await check()                       # returns bool
        decision = "allowed" if ok else "blocked"
        rl_requests.labels(route, method, tier, decision).inc()
        rl_decision.labels(route, decision).observe(time.perf_counter() - start)
        return ok
    except Exception as exc:                       # store unreachable
        name = type(exc).__name__
        rl_store_errors.labels("decide", name).inc()
        rl_fail_open.labels(route, name).inc()     # count the fail-open
        rl_decision.labels(route, "allowed").observe(time.perf_counter() - start)
        return True                                # fail open

Step 4 — Expose `/metrics`

Prometheus pulls; you serve. Keep the endpoint off the public router so customers cannot scrape your internals.

// Node: serve the registry. Bind to an internal port in production.
import express from "express";
import { register } from "./metrics";
const ops = express();
ops.get("/metrics", async (_req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});
ops.listen(9090); // internal-only ops port

# Python: mount the ASGI app (FastAPI shown) on an internal route.
from prometheus_client import make_asgi_app
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)   # restrict via network policy / auth

Step 5 — Configure the scrape

Add a job to prometheus.yml that targets every limiter instance. Service discovery (Kubernetes, Consul) replaces the static list in real deployments, but the shape is identical.

scrape_configs:
  - job_name: rate-limiter
    scrape_interval: 15s          # 15s is a sane default; 10s for tighter alerting
    metrics_path: /metrics
    static_configs:
      - targets:
          - "limiter-1.internal:9090"
          - "limiter-2.internal:9090"
          - "limiter-3.internal:9090"
        labels:
          service: api-gateway

A 15-second scrape interval means your alerting resolution is ~15 s; if you run multi-window burn-rate alerts (see Alerting on 429 Error Rates), 10 s gives the short window more samples to work with.

Verification & testing

First confirm the endpoint emits the series with the expected labels:

# Should list ratelimit_* metrics with HELP/TYPE headers and label sets.
curl -s localhost:9090/metrics | grep -E '^ratelimit_'
# Example expected line:
# ratelimit_requests_total{decision="blocked",method="GET",route="/v1/search",tier="free"} 142

Then confirm Prometheus is scraping and the math works. Drive some load, then run the block-ratio query in the Prometheus expression browser:

# Fleet-wide block ratio over 5 minutes — should be small under normal load.
sum(rate(ratelimit_requests_total{decision="blocked"}[5m]))
  /
sum(rate(ratelimit_requests_total[5m]))

# Decision latency p99 across the fleet — should sit in single-digit milliseconds.
histogram_quantile(0.99,
  sum by (le) (rate(ratelimit_decision_duration_seconds_bucket[5m])))

If the block-ratio query returns a value and the p99 is plausible, the pipeline is sound. If rate(ratelimit_fail_open_total[5m]) is ever above zero, the limiter is admitting unmetered traffic — investigate the store before trusting any other number.

Gotchas & edge cases

Define instruments once. Re-registering a metric (e.g. inside a request handler) throws in prom-client and silently shadows in some setups. Module-level definition only.
Labels are set at observe/inc, not at definition. A histogram timer started with one label set and stopped with another (the decision) is correct — but every label must resolve to a bounded value.
route must be the matched template. req.route?.path in Express, request.scope["route"].path patterns in FastAPI — never the raw URL, or cardinality explodes.
Counter resets are normal. Always query counters through rate()/increase(), which handle restarts; never subtract raw counter values.
Default metrics are optional but cheap. collectDefaultMetrics adds process/GC series that help correlate latency creep with GC pauses.
Protect /metrics. It leaks route names and traffic shape. Bind it to an internal interface or require auth.

Frequently Asked Questions

Should I use a counter or a gauge for blocked requests?

A counter. Blocked requests only ever increase, and you derive the per-second block rate with rate() and the block ratio by dividing two counters. A gauge would lose information between scrapes and could not be summed across instances.

How many histogram buckets should I define?

Eight to ten, placed where your latency actually lives — tight resolution around the sub-millisecond to few-millisecond range for a Redis-backed decision, with a couple of coarse tail buckets to catch timeouts. Each bucket is a separate series per label combination, so more buckets is real cardinality cost.

Why must I never label metrics by API key?

API keys have unbounded cardinality: every distinct key creates a new time series, and the count grows with your customer base. With hundreds of thousands of keys this exhausts memory on both the scraped process and the Prometheus server. Label by a fixed key_class (free/pro/enterprise) and answer per-key questions from logs instead.

What scrape interval should I use?

15 seconds is a sound default. Drop to 10 seconds if you run multi-window burn-rate alerts that need more samples in the short window. Going below 10 seconds rarely helps and increases load on both the targets and Prometheus.

Metrics & Instrumentation — the parent guide on what to measure and cardinality control.
Grafana Rate Limit Dashboards — build panels from these exact metrics.
Alerting on 429 Error Rates — write Prometheus alerting rules against these series.
Redis Counter Architecture — the store whose latency and errors you are recording.