Alerting on 429 Error Rates

Q: What burn rates and windows should I use?

The Google SRE defaults work well: 14.4 times over 1h and 5m for a fast-burn page, and 3 times over 6h and 1h for a slow-burn ticket. The threshold for each is the burn rate times one minus the SLO. Both the long and short window must exceed the threshold for the alert to fire, which suppresses blips and lets the alert resolve quickly.

Q: Should a 429 count against the limiter's SLO?

No, not when the client genuinely exceeded its quota. Returning 429 is the limiter's job. The SLO bad events are fail-open, fail-closed while under quota, and over-latency decisions. That is why the burn-rate rules query fail-open and store errors, not the 429 rate.

This guide is the rule file: the actual Prometheus alerting YAML and PromQL that turn rate-limiter metrics into pages and tickets. It sits under the Alerting & SLOs guide, which explains why a 429 spike can be healthy and how the error budget works; here you write the four rules that matter — 429-ratio spike by tier, store-error rate, fail-open detection, and multi-window multi-burn-rate — and wire their severity and routing. Every rule queries the ratelimit_* series from Prometheus metrics for rate limiting.

The traffic you are alerting on

Take a limiter at 2,000 rps with a 99.95% / 5 ms decision SLO over 30 days — a 0.05% budget, about 21.6 minutes of bad decisions per month. Normal block rate sits at 3–5%, almost all in anonymous/free. The rules below must page when paying users get blocked or the limiter fails open, and stay silent when a scraper is being shed correctly.

Alert	Fires when	Severity
`RateLimitFailOpen`	Fail-open rate > 0 sustained, with store errors	Page
`RateLimitStoreErrorBurn`	Store-error burn rate high (multi-window)	Page
`RateLimit429SpikePaid`	Block ratio high on `pro`/`enterprise`	Page
`RateLimit429SpikeFree`	Block ratio high but only on `anonymous`/`free`	Ticket / suppressed
`RateLimitBudgetBurnFast` / `Slow`	Error budget burning at 14.4× / 3×	Page / Ticket

Operator checklist

Confirm ratelimit_requests_total, ratelimit_store_errors_total, and ratelimit_fail_open_total Confirm `ratelimit_requests_total`, `ratelimit_store_errors_total`, and `ratelimit_fail_open_total` are being scraped.
Add recording rules for the block ratio and the SLO error rate (cheaper, reusable in alerts).
Write the fail-open page rule requiring a corroborating store-error signal.
Write the 429-spike rule split by key_class Write the 429-spike rule split by `key_class` so free-tier shedding does not page.
Add the multi-window multi-burn-rate budget rules at page and ticket severities.
Set for: Set `for:` durations so transient blips do not fire.
Route by severity label to pager vs Slack; include route, tier Route by severity label to pager vs Slack; include `route`, `tier`, and burn rate in annotations.
Verify by forcing a store outage in staging and watching the right alert — and only the right alert — fire.

Step 1 — Recording rules

Recording rules pre-compute the ratios so alert expressions stay readable and cheap. Evaluate them at the scrape interval.

groups:
  - name: ratelimit_recording
    interval: 15s
    rules:
      # Fleet block ratio, and the same split by key class.
      - record: ratelimit:block_ratio:5m
        expr: |
          sum(rate(ratelimit_requests_total{decision="blocked"}[5m]))
          / sum(rate(ratelimit_requests_total[5m]))
      - record: ratelimit:block_ratio_by_tier:5m
        expr: |
          sum by (tier) (rate(ratelimit_requests_total{decision="blocked"}[5m]))
          / sum by (tier) (rate(ratelimit_requests_total[5m]))
      # SLO bad-event ratio: fail-open + store errors over total. One window per burn rule.
      - record: ratelimit:slo_error_ratio:5m
        expr: |
          (sum(rate(ratelimit_fail_open_total[5m])) + sum(rate(ratelimit_store_errors_total[5m])))
          / sum(rate(ratelimit_requests_total[5m]))
      - record: ratelimit:slo_error_ratio:1h
        expr: |
          (sum(rate(ratelimit_fail_open_total[1h])) + sum(rate(ratelimit_store_errors_total[1h])))
          / sum(rate(ratelimit_requests_total[1h]))
      - record: ratelimit:slo_error_ratio:6h
        expr: |
          (sum(rate(ratelimit_fail_open_total[6h])) + sum(rate(ratelimit_store_errors_total[6h])))
          / sum(rate(ratelimit_requests_total[6h]))

Step 2 — Fail-open detection (the rule that must never be missing)

Fail-open is invisible except through its counter. Page on a sustained non-zero fail-open rate, corroborated by store errors so a single odd increment does not wake anyone.

  - name: ratelimit_failopen
    rules:
      - alert: RateLimitFailOpen
        expr: |
          sum(rate(ratelimit_fail_open_total[5m])) > 0
          and
          sum(rate(ratelimit_store_errors_total[5m])) > 0
        for: 2m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Rate limiter is failing open — no limiting in effect"
          description: "fail-open rate {{ $value | humanize }}/s with store errors present. The backend is unprotected; investigate the store immediately."

Step 3 — 429 spike, split by who is blocked

The whole point of the Alerting & SLOs decision table is that a free-tier spike is fine and a paid-tier spike is not. Encode that split so you never page on healthy shedding.

  - name: ratelimit_429
    rules:
      # Paid tiers blocked -> page. Legitimate paying users are being rejected.
      - alert: RateLimit429SpikePaid
        expr: |
          ratelimit:block_ratio_by_tier:5m{tier=~"pro|enterprise"} > 0.05
        for: 5m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Elevated 429s on paid tier {{ $labels.tier }}"
          description: "Block ratio {{ $value | humanizePercentage }} on {{ $labels.tier }}. Likely a too-tight limit or a bad config change."

      # Free/anonymous spike -> ticket only. Usually abuse being shed correctly.
      - alert: RateLimit429SpikeFree
        expr: |
          ratelimit:block_ratio_by_tier:5m{tier=~"anonymous|free"} > 0.5
        for: 15m
        labels: { severity: ticket, component: rate-limiter }
        annotations:
          summary: "Sustained heavy shedding on {{ $labels.tier }}"
          description: "Block ratio {{ $value | humanizePercentage }} on {{ $labels.tier }} — likely abuse. Confirm it is not a misclassified legitimate client."

Step 4 — Multi-window multi-burn-rate budget alerts

This is the rule set that protects the error budget without false positives. Each severity combines a long window (is the burn real and sustained?) with a short window (is it still happening now?). Both must exceed the burn-rate threshold, expressed as threshold = burn_rate × (1 − SLO) = burn_rate × 0.0005.

  - name: ratelimit_burnrate
    rules:
      # Fast burn: 14.4x. At this rate the 30-day budget is gone in ~2 days. Page.
      - alert: RateLimitBudgetBurnFast
        expr: |
          ratelimit:slo_error_ratio:1h > (14.4 * 0.0005)
          and
          ratelimit:slo_error_ratio:5m > (14.4 * 0.0005)
        for: 2m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Rate limiter error budget burning fast (14.4x)"
          description: "Limiter SLO error ratio {{ $value | humanizePercentage }} over 1h and 5m. Budget exhausts in ~2 days at this rate."

      # Slow burn: 3x. Budget gone in ~10 days. Ticket, not a page.
      - alert: RateLimitBudgetBurnSlow
        expr: |
          ratelimit:slo_error_ratio:6h > (3 * 0.0005)
          and
          ratelimit:slo_error_ratio:1h > (3 * 0.0005)
        for: 15m
        labels: { severity: ticket, component: rate-limiter }
        annotations:
          summary: "Rate limiter error budget burning slowly (3x)"
          description: "Limiter SLO error ratio {{ $value | humanizePercentage }} over 6h and 1h. Investigate during business hours."

The short window in each rule is what makes the alert resolve quickly: once you fix the store, the 5 m / 1 h ratio falls below threshold within minutes and the page clears, even though the long window is still elevated.

Step 5 — Store-error rate

A standalone store-error rule catches degradation before it becomes fail-open, giving you a head start.

  - name: ratelimit_store
    rules:
      - alert: RateLimitStoreErrorBurn
        expr: |
          sum(rate(ratelimit_store_errors_total[5m]))
          / sum(rate(ratelimit_requests_total[5m])) > 0.01
        for: 5m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Rate limiter store error rate above 1%"
          description: "Store-error ratio {{ $value | humanizePercentage }}. The limiter is degrading and may fail open."

Step 6 — Routing and severity

Route on the severity label in Alertmanager: page to the pager, ticket to Slack/issue tracker. Group by component so a store incident does not fan out into five simultaneous pages.

route:
  group_by: ["component"]
  routes:
    - matchers: [severity="page"]
      receiver: pagerduty
      group_wait: 30s
    - matchers: [severity="ticket"]
      receiver: slack-rate-limiting
      group_wait: 5m

Verification & testing

Validate the rule syntax and logic, then prove it in staging.

# 1. Lint the rules.
promtool check rules ratelimit-alerts.yml

# 2. Unit-test the burn-rate logic with synthetic series.
promtool test rules ratelimit-alerts.test.yml

# 3. In staging, force a store outage (block Redis) and confirm THIS goes non-zero:
sum(rate(ratelimit_fail_open_total[5m]))
# RateLimitFailOpen should fire within ~2m, and RateLimit429SpikeFree should NOT.

The test that matters most: trigger a free-tier flood and confirm no page fires, then trigger a store outage and confirm RateLimitFailOpen fires. An alerting setup that pages on healthy shedding is worse than none, because it trains responders to ignore the limiter.

Gotchas & edge cases

Never page on raw 429 rate. Always split by key_class. A fleet-wide 429 threshold pages on every scraper and gets muted within a week.
The fail-open rule needs the fail-open counter. If the limiter does not increment ratelimit_fail_open_total on the store-error path, this entire guide is blind. Verify that instrumentation first.
Burn-rate thresholds scale with your SLO. The 0.0005 factor is 1 − 0.9995. Change the SLO, change every threshold, or the burn rates silently mean something else.
for: must be shorter than the short window’s signal. A for: 10m on a 5 m short window can delay a real page past the point of usefulness. Keep for: at 2 m for page-severity burn alerts.
Recording rules avoid re-computation drift. If two alerts compute the block ratio inline with slightly different expressions, they will disagree. Compute once in a recording rule.
Mind counter resets at deploy. rate() handles resets, but a rolling deploy can briefly inflate fail-open if the limiter starts before the store connection is ready. A 2 m for: absorbs this.

Frequently Asked Questions

Why not just alert when the 429 rate crosses a threshold?

Because a high 429 rate is often the limiter working correctly — shedding abuse on the free tier. A flat threshold pages on every scraper and trains responders to ignore the alert. Split by key_class so paid-tier blocks page and free-tier shedding stays quiet, and reserve pages for fail-open and store errors.

What burn rates and windows should I use?

The Google SRE defaults work well: 14.4× over 1h/5m for a fast-burn page, and 3× over 6h/1h for a slow-burn ticket. The threshold for each is the burn rate times (1 − SLO). Both the long and short window must exceed the threshold for the alert to fire, which suppresses blips and lets the alert resolve quickly.

How do I alert on fail-open when its symptom is the absence of 429s?

You cannot alert on an absence, so you alert on the positive ratelimit_fail_open_total counter that the limiter increments whenever it admits a request because the store was unreachable. Require a corroborating non-zero store-error rate so a single increment does not page, and set a short for: so it fires fast.

Should a 429 count against the limiter's SLO?

No — not when the client genuinely exceeded its quota. Returning 429 is the limiter's job. The SLO bad events are fail-open, fail-closed-while-under-quota, and over-latency decisions. That is why the burn-rate rules query fail-open and store errors, not the 429 rate.

Alerting & SLOs — the parent guide on healthy-vs-broken 429s and the error-budget model.
Metrics & Instrumentation — the metrics every rule here queries.
Prometheus Metrics for Rate Limiting — emit the series these alerts depend on.
Grafana Rate Limit Dashboards — the panels you open when one of these alerts fires.