Tiered Access & Quota Enforcement

Selling an API in plans — free, pro, enterprise — means every request must be measured against the limit that belongs to that caller’s tier, not a single global number, and this is the throttling problem that the Backend Middleware & Distributed Tracking pillar makes concrete. A free key might get 10 requests/second and 50,000 calls/month; a pro key 100 rps and 5,000,000/month; an enterprise key a negotiated 1,000 rps with a soft monthly cap. The enforcement layer has to resolve the caller to a tier on every request, apply two independent limits (a short-window rate and a long-window quota), and answer with the right status code and headers so clients and billing systems both behave correctly. Get the mapping wrong and you either throttle a paying customer at the free-tier rate or hand free users the enterprise reservoir.

This area sits one layer above the raw limiter algorithms. A token bucket or sliding log decides whether a single bucket has capacity; tiered enforcement decides which bucket a request belongs to, how big that bucket is, and what to say when it is empty. The hard parts are all in that indirection: key→tier resolution under cache invalidation, two-layer limits with different windows and different failure semantics, and hot-reloading plan config when a customer upgrades mid-month.

The two-layer model: rate vs quota

Tiered systems almost always enforce two limits at once, because they protect different things:

  • Rate limit (short window): protects your infrastructure from bursts. Measured in requests per second or per minute, reset continuously. A token bucket or fixed-window counter is the right tool. Exceeding it returns 429 Too Many Requests with Retry-After — the client should back off and retry.
  • Quota (long window): protects your revenue model. Measured in requests per month (or day), reset on a billing boundary. Exceeding it is not a transient condition the client can retry past; it means the plan’s allotment is spent. The correct response is 402 Payment Required (or a 403 with a machine-readable quota_exceeded reason) — the client must upgrade or wait for the reset, not retry.

A request must pass both checks. The order matters: check the cheap, high-frequency rate limit first (it rejects floods before they touch the quota counter), then the quota. Conflating the two — treating a spent monthly quota as a 429 — trains clients to hammer your API with retries against a wall they can never get past.

Property Rate limit Monthly quota
Window 1 s – 1 min, rolling/fixed Calendar or billing month
Protects Backend capacity, fairness Revenue, plan boundaries
Algorithm Token bucket / fixed window Atomic counter with billing-boundary reset
Over-limit status 429 Too Many Requests 402 Payment Required / 403 quota_exceeded
Client action Back off, retry after Retry-After Stop; upgrade plan or wait for reset
Reset signal Continuous / window epoch X-Quota-Reset (billing boundary)
Counter precision Approximate is fine Exact if it drives billing

Mechanism: key → tier → limit

The state the enforcement layer manipulates per request:

  • identity — the API key, account id, or token presented by the caller (the cardinality driver).
  • tier — the resolved plan: free | pro | enterprise | custom:<id>. Derived from identity via an account lookup, cached aggressively.
  • policy — the tier’s parameters: rate, burst, quota, window. Loaded from config, hot-reloadable.
  • Two counters — a rate-limit bucket keyed by (identity, "rate") and a quota counter keyed by (account, "quota", billing_period).

The per-request decision is O(1): one cached tier lookup, one atomic rate check, one atomic quota check. The expensive operation is the uncached tier resolution (a DB or auth-service call), which is why the resolved identity → tier → policy mapping is cached with a short TTL and invalidated on plan change.

Request flow from API key through tier lookup to a per-tier limiter and decision A request resolves its API key to a tier, loads the tier policy, checks the rate bucket then the quota counter, and returns 200, 429, or 402. Request + API key key to tier cached lookup tier policy rate + quota rate bucket per second quota counter per month 200 OK 429 402

Configuration reference: tier parameters

Plan limits belong in declarative config, not scattered through code, so they can be reviewed, versioned, and hot-reloaded. A typical per-tier policy:

Parameter Type Example (free / pro / ent) Effect
rate int (req/s) 10 / 100 / 1000 Sustained per-second refill rate
burst int (tokens) 20 / 300 / 2000 Bucket capacity; momentary burst above rate
burst_multiplier float 2.0 / 3.0 / 2.0 Convenience: burst = rate × multiplier
quota int (req/month) 50000 / 5000000 / null Monthly allotment; null = uncapped (soft)
quota_window enum calendar_month When the quota counter resets
on_quota_exceeded enum block / bill_overage Hard 402 vs. meter overage and keep serving
concurrency int 5 / 50 / 500 Max in-flight requests (optional third axis)
priority int 0 / 5 / 10 Shed lowest-priority tiers first under pressure
# tiers.yaml — single source of truth, hot-reloaded on change
tiers:
  free:
    rate: 10            # tokens/sec
    burst_multiplier: 2 # capacity = 20
    quota: 50000        # requests/month
    quota_window: calendar_month
    on_quota_exceeded: block      # -> 402
  pro:
    rate: 100
    burst_multiplier: 3 # capacity = 300
    quota: 5000000
    quota_window: calendar_month
    on_quota_exceeded: block
  enterprise:
    rate: 1000
    burst_multiplier: 2 # capacity = 2000
    quota: null         # uncapped; soft alerting only
    quota_window: calendar_month
    on_quota_exceeded: bill_overage

Implementation walkthrough (Redis)

Both limits are enforced atomically in Redis so the decision survives across stateless nodes — the same authoritative-store reasoning behind the Redis counter architecture. The walkthrough below resolves the tier, then runs a single Lua script that checks the rate bucket and the quota counter together, returning a decision plus the values needed for headers.

// Tiered enforcement: resolve tier (cached) -> atomic rate + quota check in Redis.
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL!);

type Policy = { rate: number; burst: number; quota: number | null };
const POLICIES: Record<string, Policy> = {
  free:       { rate: 10,   burst: 20,   quota: 50_000 },
  pro:        { rate: 100,  burst: 300,  quota: 5_000_000 },
  enterprise: { rate: 1000, burst: 2000, quota: null },
};

// Short-TTL cache of key -> {account, tier}. Invalidated on plan change (see hot-reload).
const tierCache = new Map<string, { account: string; tier: string; exp: number }>();

async function resolveTier(apiKey: string) {
  const hit = tierCache.get(apiKey);
  if (hit && hit.exp > Date.now()) return hit;
  const row = await lookupAccount(apiKey);          // DB / auth-service call
  const entry = { account: row.account, tier: row.tier, exp: Date.now() + 30_000 };
  tierCache.set(apiKey, entry);
  return entry;
}

// Atomic: refill+consume rate bucket, then INCR quota with billing-month TTL.
// Returns { decision, rate_remaining, quota_remaining, reset }.
const ENFORCE_LUA = `
local rkey, qkey = KEYS[1], KEYS[2]
local cap   = tonumber(ARGV[1])   -- burst capacity
local rate  = tonumber(ARGV[2])   -- tokens/sec
local quota = tonumber(ARGV[3])   -- -1 means uncapped
local nowms = tonumber(ARGV[4])
local qttl  = tonumber(ARGV[5])   -- seconds until billing reset
-- 1) rate bucket (token bucket)
local b = redis.call('HMGET', rkey, 'tokens', 'ts')
local tokens = tonumber(b[1]) or cap
local ts     = tonumber(b[2]) or nowms
tokens = math.min(cap, tokens + (nowms - ts) / 1000 * rate)
if tokens < 1 then
  redis.call('HSET', rkey, 'tokens', tokens, 'ts', nowms)
  redis.call('PEXPIRE', rkey, math.ceil(cap / rate * 1000) + 1000)
  return { 'RATE', math.floor(tokens), -1, 0 }       -- 429
end
-- 2) quota counter (only consumed once rate passes)
local used = 0
if quota >= 0 then
  used = redis.call('INCR', qkey)
  if used == 1 then redis.call('EXPIRE', qkey, qttl) end
  if used > quota then return { 'QUOTA', math.floor(tokens), 0, qttl } end  -- 402
end
tokens = tokens - 1
redis.call('HSET', rkey, 'tokens', tokens, 'ts', nowms)
redis.call('PEXPIRE', rkey, math.ceil(cap / rate * 1000) + 1000)
local q_remaining = quota >= 0 and (quota - used) or -1
return { 'OK', math.floor(tokens), q_remaining, qttl }`;

function secondsToBillingReset(): number {
  const now = new Date();
  const next = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth() + 1, 1));
  return Math.ceil((next.getTime() - now.getTime()) / 1000);
}

export async function enforce(apiKey: string) {
  const { account, tier } = await resolveTier(apiKey);
  const p = POLICIES[tier] ?? POLICIES.free;          // unknown tier -> safest limit
  const period = new Date().toISOString().slice(0, 7); // "2026-06"
  const [decision, rRem, qRem, reset] = (await redis.eval(
    ENFORCE_LUA, 2,
    `rl:rate:${apiKey}`, `rl:quota:${account}:${period}`,
    p.burst, p.rate, p.quota ?? -1, Date.now(), secondsToBillingReset(),
  )) as [string, number, number, number];
  return { decision, tier, rateRemaining: rRem, quotaRemaining: qRem, reset };
}

The HTTP layer maps the decision to a status and headers:

const { decision, tier, rateRemaining, quotaRemaining, reset } = await enforce(apiKey);
res.setHeader("RateLimit-Limit", POLICIES[tier].rate);
res.setHeader("RateLimit-Remaining", Math.max(0, rateRemaining));
if (quotaRemaining >= 0) {
  res.setHeader("X-Quota-Remaining", quotaRemaining);
  res.setHeader("X-Quota-Reset", new Date(Date.now() + reset * 1000).toUTCString());
}
if (decision === "RATE")  { res.setHeader("Retry-After", 1); return res.status(429).json({ error: "rate_limited" }); }
if (decision === "QUOTA") { return res.status(402).json({ error: "quota_exceeded", reset }); }
// 200: proceed

Response contract

Scenario Status Key headers Body error
Allowed 200 RateLimit-Limit, RateLimit-Remaining, X-Quota-Remaining
Rate exceeded 429 RateLimit-Remaining: 0, Retry-After rate_limited
Quota exhausted (hard) 402 X-Quota-Remaining: 0, X-Quota-Reset quota_exceeded
Quota exhausted (overage) 200 X-Quota-Remaining: 0, X-Quota-Overage: <n>
Unknown / revoked key 401/403 invalid_key

Emit the standard RateLimit-* (and legacy X-RateLimit-* if clients still read them) for the rate axis, and a separate X-Quota-* family for the long-window quota, so a client can tell which limit it hit. The draft RateLimit header semantics are covered under the response-headers area.

Distributed & scaling considerations

  • Cache the tier, not the decision. The key→tier mapping changes rarely (only on plan change); cache it for 15–60 s with active invalidation. Never cache the limit decision — that is what Redis is for.
  • Keep the quota counter on a billing-boundary TTL. Keying by account:YYYY-MM and setting EXPIRE to seconds-until-month-end makes the reset automatic and avoids a cron sweep.
  • Co-locate the two keys. Put rl:rate:* and rl:quota:* in a hash slot or use a single Lua call so both checks are one round-trip, not two.
  • Tier skew creates hot keys. A few enterprise accounts can dominate Redis traffic; shard high-rate accounts or front them with a local pre-filter.

Failure modes & mitigations

  • Stale tier after an upgrade. A customer upgrades but the cached key→tier entry still says free, so they stay throttled for the TTL. Mitigation: publish an invalidation event (pub/sub) on plan change to evict the key immediately; keep TTL short as a backstop.
  • Quota double-count on retries. A client retrying a 5xx can INCR the quota twice for one logical call. Mitigation: idempotency keys, covered in billing-critical sliding-log usage.
  • Redis outage. Fail-open keeps customers served but lets quota slip; fail-closed protects revenue but returns 402/429 during your incident. Decide per axis — usually fail-open on rate, fail-closed (or queue-and-reconcile) on billing-critical quota.
  • Unknown tier defaults to enterprise. A config typo or missing tier silently grants the largest reservoir. Always default an unrecognized tier to the smallest policy.
  • Clock skew on the rate bucket. Pass Redis server time (redis.call('TIME')) rather than per-node clocks; see distributed algorithm sync.

Hot-reloading tier changes

Two distinct events must propagate fast: a plan change for one account (invalidate that key’s cached tier) and a policy change for a whole tier (reload tiers.yaml). Watch the config source (a file, a config service, or a Redis key) and swap the in-memory policy table atomically on change; broadcast per-account invalidations over pub/sub so every node drops the stale key→tier entry within milliseconds.

redis.subscribe("tier:invalidate");
redis.on("message", (_ch, apiKey) => tierCache.delete(apiKey)); // evict on upgrade
// On tiers.yaml change: parse, validate, then atomically replace POLICIES.

In this section