Tiered Access & Quota Enforcement

Selling an API in plans — free, pro, enterprise — means every request must be measured against the limit that belongs to that caller’s tier, not a single global number, and this is the throttling problem that the Backend Middleware & Distributed Tracking pillar makes concrete. A free key might get 10 requests/second and 50,000 calls/month; a pro key 100 rps and 5,000,000/month; an enterprise key a negotiated 1,000 rps with a soft monthly cap. The enforcement layer has to resolve the caller to a tier on every request, apply two independent limits (a short-window rate and a long-window quota), and answer with the right status code and headers so clients and billing systems both behave correctly. Get the mapping wrong and you either throttle a paying customer at the free-tier rate or hand free users the enterprise reservoir.

This area sits one layer above the raw limiter algorithms. A token bucket or sliding log decides whether a single bucket has capacity; tiered enforcement decides which bucket a request belongs to, how big that bucket is, and what to say when it is empty. The hard parts are all in that indirection: key→tier resolution under cache invalidation, two-layer limits with different windows and different failure semantics, and hot-reloading plan config when a customer upgrades mid-month.

The two-layer model: rate vs quota

Tiered systems almost always enforce two limits at once, because they protect different things:

Rate limit (short window): protects your infrastructure from bursts. Measured in requests per second or per minute, reset continuously. A token bucket or fixed-window counter is the right tool. Exceeding it returns 429 Too Many Requests with Retry-After — the client should back off and retry.
Quota (long window): protects your revenue model. Measured in requests per month (or day), reset on a billing boundary. Exceeding it is not a transient condition the client can retry past; it means the plan’s allotment is spent. The correct response is 402 Payment Required (or a 403 with a machine-readable quota_exceeded reason) — the client must upgrade or wait for the reset, not retry.

A request must pass both checks. The order matters: check the cheap, high-frequency rate limit first (it rejects floods before they touch the quota counter), then the quota. Conflating the two — treating a spent monthly quota as a 429 — trains clients to hammer your API with retries against a wall they can never get past.

Property	Rate limit	Monthly quota
Window	1 s – 1 min, rolling/fixed	Calendar or billing month
Protects	Backend capacity, fairness	Revenue, plan boundaries
Algorithm	Token bucket / fixed window	Atomic counter with billing-boundary reset
Over-limit status	`429 Too Many Requests`	`402 Payment Required` / `403 quota_exceeded`
Client action	Back off, retry after `Retry-After`	Stop; upgrade plan or wait for reset
Reset signal	Continuous / window epoch	`X-Quota-Reset` (billing boundary)
Counter precision	Approximate is fine	Exact if it drives billing

Mechanism: key → tier → limit

The state the enforcement layer manipulates per request:

identity — the API key, account id, or token presented by the caller (the cardinality driver).
tier — the resolved plan: free | pro | enterprise | custom:<id>. Derived from identity via an account lookup, cached aggressively.
policy — the tier’s parameters: rate, burst, quota, window. Loaded from config, hot-reloadable.
Two counters — a rate-limit bucket keyed by (identity, "rate") and a quota counter keyed by (account, "quota", billing_period).

The per-request decision is O(1): one cached tier lookup, one atomic rate check, one atomic quota check. The expensive operation is the uncached tier resolution (a DB or auth-service call), which is why the resolved identity → tier → policy mapping is cached with a short TTL and invalidated on plan change.

Configuration reference: tier parameters

Plan limits belong in declarative config, not scattered through code, so they can be reviewed, versioned, and hot-reloaded. A typical per-tier policy:

Parameter	Type	Example (free / pro / ent)	Effect
`rate`	int (req/s)	10 / 100 / 1000	Sustained per-second refill rate
`burst`	int (tokens)	20 / 300 / 2000	Bucket capacity; momentary burst above `rate`
`burst_multiplier`	float	2.0 / 3.0 / 2.0	Convenience: `burst = rate × multiplier`
`quota`	int (req/month)	50000 / 5000000 / null	Monthly allotment; `null` = uncapped (soft)
`quota_window`	enum	`calendar_month`	When the quota counter resets
`on_quota_exceeded`	enum	`block` / `bill_overage`	Hard 402 vs. meter overage and keep serving
`concurrency`	int	5 / 50 / 500	Max in-flight requests (optional third axis)
`priority`	int	0 / 5 / 10	Shed lowest-priority tiers first under pressure

# tiers.yaml — single source of truth, hot-reloaded on change
tiers:
  free:
    rate: 10            # tokens/sec
    burst_multiplier: 2 # capacity = 20
    quota: 50000        # requests/month
    quota_window: calendar_month
    on_quota_exceeded: block      # -> 402
  pro:
    rate: 100
    burst_multiplier: 3 # capacity = 300
    quota: 5000000
    quota_window: calendar_month
    on_quota_exceeded: block
  enterprise:
    rate: 1000
    burst_multiplier: 2 # capacity = 2000
    quota: null         # uncapped; soft alerting only
    quota_window: calendar_month
    on_quota_exceeded: bill_overage

Implementation walkthrough (Redis)

Both limits are enforced atomically in Redis so the decision survives across stateless nodes — the same authoritative-store reasoning behind the Redis counter architecture. The walkthrough below resolves the tier, then runs a single Lua script that checks the rate bucket and the quota counter together, returning a decision plus the values needed for headers.

// Tiered enforcement: resolve tier (cached) -> atomic rate + quota check in Redis.
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL!);

type Policy = { rate: number; burst: number; quota: number | null };
const POLICIES: Record<string, Policy> = {
  free:       { rate: 10,   burst: 20,   quota: 50_000 },
  pro:        { rate: 100,  burst: 300,  quota: 5_000_000 },
  enterprise: { rate: 1000, burst: 2000, quota: null },
};

// Short-TTL cache of key -> {account, tier}. Invalidated on plan change (see hot-reload).
const tierCache = new Map<string, { account: string; tier: string; exp: number }>();

async function resolveTier(apiKey: string) {
  const hit = tierCache.get(apiKey);
  if (hit && hit.exp > Date.now()) return hit;
  const row = await lookupAccount(apiKey);          // DB / auth-service call
  const entry = { account: row.account, tier: row.tier, exp: Date.now() + 30_000 };
  tierCache.set(apiKey, entry);
  return entry;
}

// Atomic: refill+consume rate bucket, then INCR quota with billing-month TTL.
// Returns { decision, rate_remaining, quota_remaining, reset }.
const ENFORCE_LUA = `
local rkey, qkey = KEYS[1], KEYS[2]
local cap   = tonumber(ARGV[1])   -- burst capacity
local rate  = tonumber(ARGV[2])   -- tokens/sec
local quota = tonumber(ARGV[3])   -- -1 means uncapped
local nowms = tonumber(ARGV[4])
local qttl  = tonumber(ARGV[5])   -- seconds until billing reset
-- 1) rate bucket (token bucket)
local b = redis.call('HMGET', rkey, 'tokens', 'ts')
local tokens = tonumber(b[1]) or cap
local ts     = tonumber(b[2]) or nowms
tokens = math.min(cap, tokens + (nowms - ts) / 1000 * rate)
if tokens < 1 then
  redis.call('HSET', rkey, 'tokens', tokens, 'ts', nowms)
  redis.call('PEXPIRE', rkey, math.ceil(cap / rate * 1000) + 1000)
  return { 'RATE', math.floor(tokens), -1, 0 }       -- 429
end
-- 2) quota counter (only consumed once rate passes)
local used = 0
if quota >= 0 then
  used = redis.call('INCR', qkey)
  if used == 1 then redis.call('EXPIRE', qkey, qttl) end
  if used > quota then return { 'QUOTA', math.floor(tokens), 0, qttl } end  -- 402
end
tokens = tokens - 1
redis.call('HSET', rkey, 'tokens', tokens, 'ts', nowms)
redis.call('PEXPIRE', rkey, math.ceil(cap / rate * 1000) + 1000)
local q_remaining = quota >= 0 and (quota - used) or -1
return { 'OK', math.floor(tokens), q_remaining, qttl }`;

function secondsToBillingReset(): number {
  const now = new Date();
  const next = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth() + 1, 1));
  return Math.ceil((next.getTime() - now.getTime()) / 1000);
}

export async function enforce(apiKey: string) {
  const { account, tier } = await resolveTier(apiKey);
  const p = POLICIES[tier] ?? POLICIES.free;          // unknown tier -> safest limit
  const period = new Date().toISOString().slice(0, 7); // "2026-06"
  const [decision, rRem, qRem, reset] = (await redis.eval(
    ENFORCE_LUA, 2,
    `rl:rate:${apiKey}`, `rl:quota:${account}:${period}`,
    p.burst, p.rate, p.quota ?? -1, Date.now(), secondsToBillingReset(),
  )) as [string, number, number, number];
  return { decision, tier, rateRemaining: rRem, quotaRemaining: qRem, reset };
}

The HTTP layer maps the decision to a status and headers:

const { decision, tier, rateRemaining, quotaRemaining, reset } = await enforce(apiKey);
res.setHeader("RateLimit-Limit", POLICIES[tier].rate);
res.setHeader("RateLimit-Remaining", Math.max(0, rateRemaining));
if (quotaRemaining >= 0) {
  res.setHeader("X-Quota-Remaining", quotaRemaining);
  res.setHeader("X-Quota-Reset", new Date(Date.now() + reset * 1000).toUTCString());
}
if (decision === "RATE")  { res.setHeader("Retry-After", 1); return res.status(429).json({ error: "rate_limited" }); }
if (decision === "QUOTA") { return res.status(402).json({ error: "quota_exceeded", reset }); }
// 200: proceed

Response contract

Scenario	Status	Key headers	Body `error`
Allowed	`200`	`RateLimit-Limit`, `RateLimit-Remaining`, `X-Quota-Remaining`	—
Rate exceeded	`429`	`RateLimit-Remaining: 0`, `Retry-After`	`rate_limited`
Quota exhausted (hard)	`402`	`X-Quota-Remaining: 0`, `X-Quota-Reset`	`quota_exceeded`
Quota exhausted (overage)	`200`	`X-Quota-Remaining: 0`, `X-Quota-Overage: <n>`	—
Unknown / revoked key	`401`/`403`	—	`invalid_key`

Emit the standard RateLimit-* (and legacy X-RateLimit-* if clients still read them) for the rate axis, and a separate X-Quota-* family for the long-window quota, so a client can tell which limit it hit. The draft RateLimit header semantics are covered under the response-headers area.

Distributed & scaling considerations

Cache the tier, not the decision. The key→tier mapping changes rarely (only on plan change); cache it for 15–60 s with active invalidation. Never cache the limit decision — that is what Redis is for.
Keep the quota counter on a billing-boundary TTL. Keying by account:YYYY-MM and setting EXPIRE to seconds-until-month-end makes the reset automatic and avoids a cron sweep.
Co-locate the two keys. Put rl:rate:* and rl:quota:* in a hash slot or use a single Lua call so both checks are one round-trip, not two.
Tier skew creates hot keys. A few enterprise accounts can dominate Redis traffic; shard high-rate accounts or front them with a local pre-filter.

Failure modes & mitigations

Stale tier after an upgrade. A customer upgrades but the cached key→tier entry still says free, so they stay throttled for the TTL. Mitigation: publish an invalidation event (pub/sub) on plan change to evict the key immediately; keep TTL short as a backstop.
Quota double-count on retries. A client retrying a 5xx can INCR the quota twice for one logical call. Mitigation: idempotency keys, covered in billing-critical sliding-log usage.
Redis outage. Fail-open keeps customers served but lets quota slip; fail-closed protects revenue but returns 402/429 during your incident. Decide per axis — usually fail-open on rate, fail-closed (or queue-and-reconcile) on billing-critical quota.
Unknown tier defaults to enterprise. A config typo or missing tier silently grants the largest reservoir. Always default an unrecognized tier to the smallest policy.
Clock skew on the rate bucket. Pass Redis server time (redis.call('TIME')) rather than per-node clocks; see distributed algorithm sync.

Hot-reloading tier changes

Two distinct events must propagate fast: a plan change for one account (invalidate that key’s cached tier) and a policy change for a whole tier (reload tiers.yaml). Watch the config source (a file, a config service, or a Redis key) and swap the in-memory policy table atomically on change; broadcast per-account invalidations over pub/sub so every node drops the stale key→tier entry within milliseconds.

redis.subscribe("tier:invalidate");
redis.on("message", (_ch, apiKey) => tierCache.delete(apiKey)); // evict on upgrade
// On tiers.yaml change: parse, validate, then atomically replace POLICIES.

In this section

Per-Tier Quota Enforcement With Redis — the step-by-step build: resolve key→tier, atomic Lua token bucket keyed by (account, tier), emit headers, verify under load.
API Key Scoping & Rate Limits — scoping limits by key, scope/permission, and route; hierarchical keys and key-class buckets to control cardinality.
Billing-Critical Sliding-Log Usage — using an exact sliding log for metered usage, with idempotency, audit trails, and reconciliation against the billing system.

Backend Middleware & Distributed Tracking — the parent topic on distributed throttling state.
Redis Counter Architecture — how the authoritative counters are built and scaled.
Sliding Log Counters — the exact algorithm behind billing-grade usage metering.
Fixed Window vs Sliding Window — window choice for the rate axis.