Tiered Access & Quota Enforcement
Selling an API in plans — free, pro, enterprise — means every request must be measured against the limit that belongs to that caller’s tier, not a single global number, and this is the throttling problem that the Backend Middleware & Distributed Tracking pillar makes concrete. A free key might get 10 requests/second and 50,000 calls/month; a pro key 100 rps and 5,000,000/month; an enterprise key a negotiated 1,000 rps with a soft monthly cap. The enforcement layer has to resolve the caller to a tier on every request, apply two independent limits (a short-window rate and a long-window quota), and answer with the right status code and headers so clients and billing systems both behave correctly. Get the mapping wrong and you either throttle a paying customer at the free-tier rate or hand free users the enterprise reservoir.
This area sits one layer above the raw limiter algorithms. A token bucket or sliding log decides whether a single bucket has capacity; tiered enforcement decides which bucket a request belongs to, how big that bucket is, and what to say when it is empty. The hard parts are all in that indirection: key→tier resolution under cache invalidation, two-layer limits with different windows and different failure semantics, and hot-reloading plan config when a customer upgrades mid-month.
The two-layer model: rate vs quota
Tiered systems almost always enforce two limits at once, because they protect different things:
- Rate limit (short window): protects your infrastructure from bursts. Measured in requests per second or per minute, reset continuously. A token bucket or fixed-window counter is the right tool. Exceeding it returns 429 Too Many Requests with
Retry-After— the client should back off and retry. - Quota (long window): protects your revenue model. Measured in requests per month (or day), reset on a billing boundary. Exceeding it is not a transient condition the client can retry past; it means the plan’s allotment is spent. The correct response is 402 Payment Required (or a
403with a machine-readablequota_exceededreason) — the client must upgrade or wait for the reset, not retry.
A request must pass both checks. The order matters: check the cheap, high-frequency rate limit first (it rejects floods before they touch the quota counter), then the quota. Conflating the two — treating a spent monthly quota as a 429 — trains clients to hammer your API with retries against a wall they can never get past.
| Property | Rate limit | Monthly quota |
|---|---|---|
| Window | 1 s – 1 min, rolling/fixed | Calendar or billing month |
| Protects | Backend capacity, fairness | Revenue, plan boundaries |
| Algorithm | Token bucket / fixed window | Atomic counter with billing-boundary reset |
| Over-limit status | 429 Too Many Requests |
402 Payment Required / 403 quota_exceeded |
| Client action | Back off, retry after Retry-After |
Stop; upgrade plan or wait for reset |
| Reset signal | Continuous / window epoch | X-Quota-Reset (billing boundary) |
| Counter precision | Approximate is fine | Exact if it drives billing |
Mechanism: key → tier → limit
The state the enforcement layer manipulates per request:
identity— the API key, account id, or token presented by the caller (the cardinality driver).tier— the resolved plan:free | pro | enterprise | custom:<id>. Derived fromidentityvia an account lookup, cached aggressively.policy— the tier’s parameters:rate,burst,quota,window. Loaded from config, hot-reloadable.- Two counters — a rate-limit bucket keyed by
(identity, "rate")and a quota counter keyed by(account, "quota", billing_period).
The per-request decision is O(1): one cached tier lookup, one atomic rate check, one atomic quota check. The expensive operation is the uncached tier resolution (a DB or auth-service call), which is why the resolved identity → tier → policy mapping is cached with a short TTL and invalidated on plan change.
Configuration reference: tier parameters
Plan limits belong in declarative config, not scattered through code, so they can be reviewed, versioned, and hot-reloaded. A typical per-tier policy:
| Parameter | Type | Example (free / pro / ent) | Effect |
|---|---|---|---|
rate |
int (req/s) | 10 / 100 / 1000 | Sustained per-second refill rate |
burst |
int (tokens) | 20 / 300 / 2000 | Bucket capacity; momentary burst above rate |
burst_multiplier |
float | 2.0 / 3.0 / 2.0 | Convenience: burst = rate × multiplier |
quota |
int (req/month) | 50000 / 5000000 / null | Monthly allotment; null = uncapped (soft) |
quota_window |
enum | calendar_month |
When the quota counter resets |
on_quota_exceeded |
enum | block / bill_overage |
Hard 402 vs. meter overage and keep serving |
concurrency |
int | 5 / 50 / 500 | Max in-flight requests (optional third axis) |
priority |
int | 0 / 5 / 10 | Shed lowest-priority tiers first under pressure |
# tiers.yaml — single source of truth, hot-reloaded on change
tiers:
free:
rate: 10 # tokens/sec
burst_multiplier: 2 # capacity = 20
quota: 50000 # requests/month
quota_window: calendar_month
on_quota_exceeded: block # -> 402
pro:
rate: 100
burst_multiplier: 3 # capacity = 300
quota: 5000000
quota_window: calendar_month
on_quota_exceeded: block
enterprise:
rate: 1000
burst_multiplier: 2 # capacity = 2000
quota: null # uncapped; soft alerting only
quota_window: calendar_month
on_quota_exceeded: bill_overage
Implementation walkthrough (Redis)
Both limits are enforced atomically in Redis so the decision survives across stateless nodes — the same authoritative-store reasoning behind the Redis counter architecture. The walkthrough below resolves the tier, then runs a single Lua script that checks the rate bucket and the quota counter together, returning a decision plus the values needed for headers.
// Tiered enforcement: resolve tier (cached) -> atomic rate + quota check in Redis.
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL!);
type Policy = { rate: number; burst: number; quota: number | null };
const POLICIES: Record<string, Policy> = {
free: { rate: 10, burst: 20, quota: 50_000 },
pro: { rate: 100, burst: 300, quota: 5_000_000 },
enterprise: { rate: 1000, burst: 2000, quota: null },
};
// Short-TTL cache of key -> {account, tier}. Invalidated on plan change (see hot-reload).
const tierCache = new Map<string, { account: string; tier: string; exp: number }>();
async function resolveTier(apiKey: string) {
const hit = tierCache.get(apiKey);
if (hit && hit.exp > Date.now()) return hit;
const row = await lookupAccount(apiKey); // DB / auth-service call
const entry = { account: row.account, tier: row.tier, exp: Date.now() + 30_000 };
tierCache.set(apiKey, entry);
return entry;
}
// Atomic: refill+consume rate bucket, then INCR quota with billing-month TTL.
// Returns { decision, rate_remaining, quota_remaining, reset }.
const ENFORCE_LUA = `
local rkey, qkey = KEYS[1], KEYS[2]
local cap = tonumber(ARGV[1]) -- burst capacity
local rate = tonumber(ARGV[2]) -- tokens/sec
local quota = tonumber(ARGV[3]) -- -1 means uncapped
local nowms = tonumber(ARGV[4])
local qttl = tonumber(ARGV[5]) -- seconds until billing reset
-- 1) rate bucket (token bucket)
local b = redis.call('HMGET', rkey, 'tokens', 'ts')
local tokens = tonumber(b[1]) or cap
local ts = tonumber(b[2]) or nowms
tokens = math.min(cap, tokens + (nowms - ts) / 1000 * rate)
if tokens < 1 then
redis.call('HSET', rkey, 'tokens', tokens, 'ts', nowms)
redis.call('PEXPIRE', rkey, math.ceil(cap / rate * 1000) + 1000)
return { 'RATE', math.floor(tokens), -1, 0 } -- 429
end
-- 2) quota counter (only consumed once rate passes)
local used = 0
if quota >= 0 then
used = redis.call('INCR', qkey)
if used == 1 then redis.call('EXPIRE', qkey, qttl) end
if used > quota then return { 'QUOTA', math.floor(tokens), 0, qttl } end -- 402
end
tokens = tokens - 1
redis.call('HSET', rkey, 'tokens', tokens, 'ts', nowms)
redis.call('PEXPIRE', rkey, math.ceil(cap / rate * 1000) + 1000)
local q_remaining = quota >= 0 and (quota - used) or -1
return { 'OK', math.floor(tokens), q_remaining, qttl }`;
function secondsToBillingReset(): number {
const now = new Date();
const next = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth() + 1, 1));
return Math.ceil((next.getTime() - now.getTime()) / 1000);
}
export async function enforce(apiKey: string) {
const { account, tier } = await resolveTier(apiKey);
const p = POLICIES[tier] ?? POLICIES.free; // unknown tier -> safest limit
const period = new Date().toISOString().slice(0, 7); // "2026-06"
const [decision, rRem, qRem, reset] = (await redis.eval(
ENFORCE_LUA, 2,
`rl:rate:${apiKey}`, `rl:quota:${account}:${period}`,
p.burst, p.rate, p.quota ?? -1, Date.now(), secondsToBillingReset(),
)) as [string, number, number, number];
return { decision, tier, rateRemaining: rRem, quotaRemaining: qRem, reset };
}
The HTTP layer maps the decision to a status and headers:
const { decision, tier, rateRemaining, quotaRemaining, reset } = await enforce(apiKey);
res.setHeader("RateLimit-Limit", POLICIES[tier].rate);
res.setHeader("RateLimit-Remaining", Math.max(0, rateRemaining));
if (quotaRemaining >= 0) {
res.setHeader("X-Quota-Remaining", quotaRemaining);
res.setHeader("X-Quota-Reset", new Date(Date.now() + reset * 1000).toUTCString());
}
if (decision === "RATE") { res.setHeader("Retry-After", 1); return res.status(429).json({ error: "rate_limited" }); }
if (decision === "QUOTA") { return res.status(402).json({ error: "quota_exceeded", reset }); }
// 200: proceed
Response contract
| Scenario | Status | Key headers | Body error |
|---|---|---|---|
| Allowed | 200 |
RateLimit-Limit, RateLimit-Remaining, X-Quota-Remaining |
— |
| Rate exceeded | 429 |
RateLimit-Remaining: 0, Retry-After |
rate_limited |
| Quota exhausted (hard) | 402 |
X-Quota-Remaining: 0, X-Quota-Reset |
quota_exceeded |
| Quota exhausted (overage) | 200 |
X-Quota-Remaining: 0, X-Quota-Overage: <n> |
— |
| Unknown / revoked key | 401/403 |
— | invalid_key |
Emit the standard RateLimit-* (and legacy X-RateLimit-* if clients still read them) for the rate axis, and a separate X-Quota-* family for the long-window quota, so a client can tell which limit it hit. The draft RateLimit header semantics are covered under the response-headers area.
Distributed & scaling considerations
- Cache the tier, not the decision. The
key→tiermapping changes rarely (only on plan change); cache it for 15–60 s with active invalidation. Never cache the limit decision — that is what Redis is for. - Keep the quota counter on a billing-boundary TTL. Keying by
account:YYYY-MMand settingEXPIREto seconds-until-month-end makes the reset automatic and avoids a cron sweep. - Co-locate the two keys. Put
rl:rate:*andrl:quota:*in a hash slot or use a single Lua call so both checks are one round-trip, not two. - Tier skew creates hot keys. A few enterprise accounts can dominate Redis traffic; shard high-rate accounts or front them with a local pre-filter.
Failure modes & mitigations
- Stale tier after an upgrade. A customer upgrades but the cached
key→tierentry still saysfree, so they stay throttled for the TTL. Mitigation: publish an invalidation event (pub/sub) on plan change to evict the key immediately; keep TTL short as a backstop. - Quota double-count on retries. A client retrying a 5xx can
INCRthe quota twice for one logical call. Mitigation: idempotency keys, covered in billing-critical sliding-log usage. - Redis outage. Fail-open keeps customers served but lets quota slip; fail-closed protects revenue but returns 402/429 during your incident. Decide per axis — usually fail-open on rate, fail-closed (or queue-and-reconcile) on billing-critical quota.
- Unknown tier defaults to enterprise. A config typo or missing tier silently grants the largest reservoir. Always default an unrecognized tier to the smallest policy.
- Clock skew on the rate bucket. Pass Redis server time (
redis.call('TIME')) rather than per-node clocks; see distributed algorithm sync.
Hot-reloading tier changes
Two distinct events must propagate fast: a plan change for one account (invalidate that key’s cached tier) and a policy change for a whole tier (reload tiers.yaml). Watch the config source (a file, a config service, or a Redis key) and swap the in-memory policy table atomically on change; broadcast per-account invalidations over pub/sub so every node drops the stale key→tier entry within milliseconds.
redis.subscribe("tier:invalidate");
redis.on("message", (_ch, apiKey) => tierCache.delete(apiKey)); // evict on upgrade
// On tiers.yaml change: parse, validate, then atomically replace POLICIES.
In this section
- Per-Tier Quota Enforcement With Redis — the step-by-step build: resolve key→tier, atomic Lua token bucket keyed by
(account, tier), emit headers, verify under load. - API Key Scoping & Rate Limits — scoping limits by key, scope/permission, and route; hierarchical keys and key-class buckets to control cardinality.
- Billing-Critical Sliding-Log Usage — using an exact sliding log for metered usage, with idempotency, audit trails, and reconciliation against the billing system.
Related
- Backend Middleware & Distributed Tracking — the parent topic on distributed throttling state.
- Redis Counter Architecture — how the authoritative counters are built and scaled.
- Sliding Log Counters — the exact algorithm behind billing-grade usage metering.
- Fixed Window vs Sliding Window — window choice for the rate axis.