Alerting on 429 Error Rates

This guide is the rule file: the actual Prometheus alerting YAML and PromQL that turn rate-limiter metrics into pages and tickets. It sits under the Alerting & SLOs guide, which explains why a 429 spike can be healthy and how the error budget works; here you write the four rules that matter — 429-ratio spike by tier, store-error rate, fail-open detection, and multi-window multi-burn-rate — and wire their severity and routing. Every rule queries the ratelimit_* series from Prometheus metrics for rate limiting.

The traffic you are alerting on

Take a limiter at 2,000 rps with a 99.95% / 5 ms decision SLO over 30 days — a 0.05% budget, about 21.6 minutes of bad decisions per month. Normal block rate sits at 3–5%, almost all in anonymous/free. The rules below must page when paying users get blocked or the limiter fails open, and stay silent when a scraper is being shed correctly.

Alert Fires when Severity
RateLimitFailOpen Fail-open rate > 0 sustained, with store errors Page
RateLimitStoreErrorBurn Store-error burn rate high (multi-window) Page
RateLimit429SpikePaid Block ratio high on pro/enterprise Page
RateLimit429SpikeFree Block ratio high but only on anonymous/free Ticket / suppressed
RateLimitBudgetBurnFast / Slow Error budget burning at 14.4× / 3× Page / Ticket
Multi-window burn-rate alert gating A page fires only when both the long-window and short-window burn rates exceed the threshold; the long window confirms the burn is sustained and the short window confirms it is still happening now. long window (1h) burn > threshold? short window (5m) still burning now? AND both true PAGE resolves fast

Operator checklist

  • Confirm ratelimit_requests_total, ratelimit_store_errors_total, and ratelimit_fail_open_total
  • Write the 429-spike rule split by key_class
  • Set for:
  • Route by severity label to pager vs Slack; include route, tier

Step 1 — Recording rules

Recording rules pre-compute the ratios so alert expressions stay readable and cheap. Evaluate them at the scrape interval.

groups:
  - name: ratelimit_recording
    interval: 15s
    rules:
      # Fleet block ratio, and the same split by key class.
      - record: ratelimit:block_ratio:5m
        expr: |
          sum(rate(ratelimit_requests_total{decision="blocked"}[5m]))
          / sum(rate(ratelimit_requests_total[5m]))
      - record: ratelimit:block_ratio_by_tier:5m
        expr: |
          sum by (tier) (rate(ratelimit_requests_total{decision="blocked"}[5m]))
          / sum by (tier) (rate(ratelimit_requests_total[5m]))
      # SLO bad-event ratio: fail-open + store errors over total. One window per burn rule.
      - record: ratelimit:slo_error_ratio:5m
        expr: |
          (sum(rate(ratelimit_fail_open_total[5m])) + sum(rate(ratelimit_store_errors_total[5m])))
          / sum(rate(ratelimit_requests_total[5m]))
      - record: ratelimit:slo_error_ratio:1h
        expr: |
          (sum(rate(ratelimit_fail_open_total[1h])) + sum(rate(ratelimit_store_errors_total[1h])))
          / sum(rate(ratelimit_requests_total[1h]))
      - record: ratelimit:slo_error_ratio:6h
        expr: |
          (sum(rate(ratelimit_fail_open_total[6h])) + sum(rate(ratelimit_store_errors_total[6h])))
          / sum(rate(ratelimit_requests_total[6h]))

Step 2 — Fail-open detection (the rule that must never be missing)

Fail-open is invisible except through its counter. Page on a sustained non-zero fail-open rate, corroborated by store errors so a single odd increment does not wake anyone.

  - name: ratelimit_failopen
    rules:
      - alert: RateLimitFailOpen
        expr: |
          sum(rate(ratelimit_fail_open_total[5m])) > 0
          and
          sum(rate(ratelimit_store_errors_total[5m])) > 0
        for: 2m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Rate limiter is failing open — no limiting in effect"
          description: "fail-open rate {{ $value | humanize }}/s with store errors present. The backend is unprotected; investigate the store immediately."

Step 3 — 429 spike, split by who is blocked

The whole point of the Alerting & SLOs decision table is that a free-tier spike is fine and a paid-tier spike is not. Encode that split so you never page on healthy shedding.

  - name: ratelimit_429
    rules:
      # Paid tiers blocked -> page. Legitimate paying users are being rejected.
      - alert: RateLimit429SpikePaid
        expr: |
          ratelimit:block_ratio_by_tier:5m{tier=~"pro|enterprise"} > 0.05
        for: 5m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Elevated 429s on paid tier {{ $labels.tier }}"
          description: "Block ratio {{ $value | humanizePercentage }} on {{ $labels.tier }}. Likely a too-tight limit or a bad config change."

      # Free/anonymous spike -> ticket only. Usually abuse being shed correctly.
      - alert: RateLimit429SpikeFree
        expr: |
          ratelimit:block_ratio_by_tier:5m{tier=~"anonymous|free"} > 0.5
        for: 15m
        labels: { severity: ticket, component: rate-limiter }
        annotations:
          summary: "Sustained heavy shedding on {{ $labels.tier }}"
          description: "Block ratio {{ $value | humanizePercentage }} on {{ $labels.tier }} — likely abuse. Confirm it is not a misclassified legitimate client."

Step 4 — Multi-window multi-burn-rate budget alerts

This is the rule set that protects the error budget without false positives. Each severity combines a long window (is the burn real and sustained?) with a short window (is it still happening now?). Both must exceed the burn-rate threshold, expressed as threshold = burn_rate × (1 − SLO) = burn_rate × 0.0005.

  - name: ratelimit_burnrate
    rules:
      # Fast burn: 14.4x. At this rate the 30-day budget is gone in ~2 days. Page.
      - alert: RateLimitBudgetBurnFast
        expr: |
          ratelimit:slo_error_ratio:1h > (14.4 * 0.0005)
          and
          ratelimit:slo_error_ratio:5m > (14.4 * 0.0005)
        for: 2m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Rate limiter error budget burning fast (14.4x)"
          description: "Limiter SLO error ratio {{ $value | humanizePercentage }} over 1h and 5m. Budget exhausts in ~2 days at this rate."

      # Slow burn: 3x. Budget gone in ~10 days. Ticket, not a page.
      - alert: RateLimitBudgetBurnSlow
        expr: |
          ratelimit:slo_error_ratio:6h > (3 * 0.0005)
          and
          ratelimit:slo_error_ratio:1h > (3 * 0.0005)
        for: 15m
        labels: { severity: ticket, component: rate-limiter }
        annotations:
          summary: "Rate limiter error budget burning slowly (3x)"
          description: "Limiter SLO error ratio {{ $value | humanizePercentage }} over 6h and 1h. Investigate during business hours."

The short window in each rule is what makes the alert resolve quickly: once you fix the store, the 5 m / 1 h ratio falls below threshold within minutes and the page clears, even though the long window is still elevated.

Step 5 — Store-error rate

A standalone store-error rule catches degradation before it becomes fail-open, giving you a head start.

  - name: ratelimit_store
    rules:
      - alert: RateLimitStoreErrorBurn
        expr: |
          sum(rate(ratelimit_store_errors_total[5m]))
          / sum(rate(ratelimit_requests_total[5m])) > 0.01
        for: 5m
        labels: { severity: page, component: rate-limiter }
        annotations:
          summary: "Rate limiter store error rate above 1%"
          description: "Store-error ratio {{ $value | humanizePercentage }}. The limiter is degrading and may fail open."

Step 6 — Routing and severity

Route on the severity label in Alertmanager: page to the pager, ticket to Slack/issue tracker. Group by component so a store incident does not fan out into five simultaneous pages.

route:
  group_by: ["component"]
  routes:
    - matchers: [severity="page"]
      receiver: pagerduty
      group_wait: 30s
    - matchers: [severity="ticket"]
      receiver: slack-rate-limiting
      group_wait: 5m

Verification & testing

Validate the rule syntax and logic, then prove it in staging.

# 1. Lint the rules.
promtool check rules ratelimit-alerts.yml

# 2. Unit-test the burn-rate logic with synthetic series.
promtool test rules ratelimit-alerts.test.yml
# 3. In staging, force a store outage (block Redis) and confirm THIS goes non-zero:
sum(rate(ratelimit_fail_open_total[5m]))
# RateLimitFailOpen should fire within ~2m, and RateLimit429SpikeFree should NOT.

The test that matters most: trigger a free-tier flood and confirm no page fires, then trigger a store outage and confirm RateLimitFailOpen fires. An alerting setup that pages on healthy shedding is worse than none, because it trains responders to ignore the limiter.

Gotchas & edge cases

  • Never page on raw 429 rate. Always split by key_class. A fleet-wide 429 threshold pages on every scraper and gets muted within a week.
  • The fail-open rule needs the fail-open counter. If the limiter does not increment ratelimit_fail_open_total on the store-error path, this entire guide is blind. Verify that instrumentation first.
  • Burn-rate thresholds scale with your SLO. The 0.0005 factor is 1 − 0.9995. Change the SLO, change every threshold, or the burn rates silently mean something else.
  • for: must be shorter than the short window’s signal. A for: 10m on a 5 m short window can delay a real page past the point of usefulness. Keep for: at 2 m for page-severity burn alerts.
  • Recording rules avoid re-computation drift. If two alerts compute the block ratio inline with slightly different expressions, they will disagree. Compute once in a recording rule.
  • Mind counter resets at deploy. rate() handles resets, but a rolling deploy can briefly inflate fail-open if the limiter starts before the store connection is ready. A 2 m for: absorbs this.

Frequently Asked Questions

Why not just alert when the 429 rate crosses a threshold?

Because a high 429 rate is often the limiter working correctly — shedding abuse on the free tier. A flat threshold pages on every scraper and trains responders to ignore the alert. Split by key_class so paid-tier blocks page and free-tier shedding stays quiet, and reserve pages for fail-open and store errors.

What burn rates and windows should I use?

The Google SRE defaults work well: 14.4× over 1h/5m for a fast-burn page, and 3× over 6h/1h for a slow-burn ticket. The threshold for each is the burn rate times (1 − SLO). Both the long and short window must exceed the threshold for the alert to fire, which suppresses blips and lets the alert resolve quickly.

How do I alert on fail-open when its symptom is the absence of 429s?

You cannot alert on an absence, so you alert on the positive ratelimit_fail_open_total counter that the limiter increments whenever it admits a request because the store was unreachable. Require a corroborating non-zero store-error rate so a single increment does not page, and set a short for: so it fires fast.

Should a 429 count against the limiter's SLO?

No — not when the client genuinely exceeded its quota. Returning 429 is the limiter's job. The SLO bad events are fail-open, fail-closed-while-under-quota, and over-latency decisions. That is why the burn-rate rules query fail-open and store errors, not the 429 rate.