Implementing Sliding Window in Memory

Q: When should I use a raw timestamp buffer versus the weighted approximation?

Use the raw circular buffer for exact counts when max_requests per client is small, since memory is max_requests times 8 bytes. Switch to the weighted two-counter approximation when limits are large or you have millions of keys; it is strict O(1) per client at the cost of a small approximation error near the boundary.

Q: How do I stop inactive clients from leaking memory?

Run a periodic sweep that deletes any client whose buffer count has drained to zero, and enforce a hard memory limit with LRU eviction as a backstop. Without eviction, the per-client map grows unbounded as one-off IPs accumulate.

1. Architectural Context: Why In-Memory Sliding Windows?

Naive fixed-window counters introduce boundary spikes where traffic doubles at window rollover, violating strict API SLAs. This guide shows how to build a single-node sliding window in process memory, the concrete implementation behind the precision advantage described in the parent Fixed Window vs Sliding Window guide. For low-latency API gateways, the operational gap between fixed counters and precise sliding logic is measured in sub-2ms P99 latency and strict heap allocation constraints.

When evaluating where this primitive fits, the in-memory sliding window emerges as the optimal compromise between distributed accuracy and local execution speed. By anchoring state to process memory, you bypass network round-trips and serialization overhead inherent in Redis-backed counters. However, this architecture demands rigorous memory management: every millisecond of GC pause or unbounded heap growth directly translates to dropped requests or cascading timeouts.

To guarantee deterministic performance, implementations must rely on monotonic clock sources rather than wall-clock time. Monotonic timers (process.hrtime.bigint() in Node.js; in Go, time.Now() embeds a monotonic reading since Go 1.9, so a t2.Sub(t1) duration stays correct across NTP steps) prevent negative deltas during clock adjustments and ensure window boundaries advance predictably under all system load conditions.

2. Core Data Structure: Bounded Circular Buffers & Sorted Timestamps

Tracking request timestamps efficiently requires a memory layout that avoids dynamic allocations during the hot path. Two primary approaches exist: dynamic sorted sets (e.g., TreeSet, BTree) and array-backed circular buffers. While sorted sets offer O(log N) lookups, they incur pointer chasing, node allocations, and unpredictable GC pressure. For high-throughput rate limiting, a pre-allocated circular buffer is strictly superior.

The diagram traces how headIndex and tailIndex move over a fixed-capacity ring: new timestamps are written at the head, expired ones are skipped by advancing the tail, and the live count is the gap between the two pointers.

Memory Layout & Pointer Arithmetic

The buffer stores raw timestamp integers (int64 for nanosecond precision). Two pointers govern traversal:

headIndex: Points to the next available slot for insertion.
tailIndex: Points to the oldest valid timestamp within the active window.

Insertion is strictly O(1). When a new request arrives, the timestamp is written to data[headIndex], and headIndex advances modulo capacity. Boundary calculation avoids full array scans by lazily advancing tailIndex past expired entries. The active request count is derived as (headIndex - tailIndex + capacity) % capacity.

Garbage Collection Avoidance

Dynamic arrays trigger stop-the-world GC cycles when resized. By pre-allocating maxRequests * sizeof(int64) bytes during initialization and reusing the same memory block across the process lifecycle, you eliminate allocation churn. Timestamp precision should be capped at milliseconds for rate limiting; nanosecond granularity rarely impacts business logic but doubles storage footprint and comparison overhead.

3. Implementation Pattern: Thread-Safe Request Tracking

Concurrency control is the critical failure surface for in-memory counters. In multi-threaded or worker-threaded runtimes, unprotected writes to shared circular buffers cause double-counting or lost increments. The following implementation uses atomic pointer advancement and per-client mutex isolation to guarantee linearizability without global lock contention.

package ratelimit

import (
	"sync"
	"time"
)

// CircularBuffer holds pre-allocated timestamp slots
type CircularBuffer struct {
	data []int64
	capacity int
	headIndex int
	tailIndex int
	count int
	mu sync.Mutex
}

// InMemorySlidingWindowCounter implements thread-safe sliding window logic
type InMemorySlidingWindowCounter struct {
	maxRequests int
	windowMs int64
	cleanupIntervalMs int64
	clients sync.Map
	lastCleanupTs int64
	lockMutex sync.Mutex
}

// NewInMemorySlidingWindowCounter initializes the counter with bounded memory
func NewInMemorySlidingWindowCounter(maxRequests int, windowMs int64, cleanupIntervalMs int64) *InMemorySlidingWindowCounter {
	return &InMemorySlidingWindowCounter{
		maxRequests:       maxRequests,
		windowMs:          windowMs,
		cleanupIntervalMs: cleanupIntervalMs,
		lastCleanupTs:     time.Now().UnixMilli(),
	}
}

// RecordRequest evaluates and tracks a client request. Returns true if allowed.
func (c *InMemorySlidingWindowCounter) RecordRequest(clientID string) bool {
	buf, _ := c.clients.LoadOrStore(clientID, &CircularBuffer{
		data:     make([]int64, c.maxRequests),
		capacity: c.maxRequests,
	})
	cb := buf.(*CircularBuffer)

	cb.mu.Lock()
	defer cb.mu.Unlock()

	now := time.Now().UnixMilli()
	cutoff := now - c.windowMs

	// Advance tail past expired entries (amortized O(1))
	for cb.count > 0 && cb.data[cb.tailIndex] <= cutoff {
		cb.tailIndex = (cb.tailIndex + 1) % cb.capacity
		cb.count--
	}

	if cb.count >= c.maxRequests {
		return false
	}

	// Insert timestamp and advance head
	cb.data[cb.headIndex] = now
	cb.headIndex = (cb.headIndex + 1) % cb.capacity
	cb.count++
	return true
}

// GetCurrentUsage returns active requests within the sliding window
func (c *InMemorySlidingWindowCounter) GetCurrentUsage(clientID string) int {
	buf, ok := c.clients.Load(clientID)
	if !ok {
		return 0
	}
	cb := buf.(*CircularBuffer)
	cb.mu.Lock()
	defer cb.mu.Unlock()

	now := time.Now().UnixMilli()
	cutoff := now - c.windowMs
	active := 0

	for i := 0; i < cb.count; i++ {
		idx := (cb.tailIndex + i) % cb.capacity
		if cb.data[idx] > cutoff {
			active++
		}
	}
	return active
}

// EvictStaleEntries runs periodic cleanup to reclaim memory from inactive clients
func (c *InMemorySlidingWindowCounter) EvictStaleEntries() {
	c.lockMutex.Lock()
	now := time.Now().UnixMilli()
	if now-c.lastCleanupTs < c.cleanupIntervalMs {
		c.lockMutex.Unlock()
		return
	}
	c.lastCleanupTs = now
	c.lockMutex.Unlock()

	c.clients.Range(func(key, value interface{}) bool {
		cb := value.(*CircularBuffer)
		cb.mu.Lock()
		if cb.count == 0 {
			c.clients.Delete(key)
		}
		cb.mu.Unlock()
		return true
	})
}

// Reset clears all state for a specific client
func (c *InMemorySlidingWindowCounter) Reset(clientID string) {
	c.clients.Delete(clientID)
}

Concurrency Strategy

Lock-Free CAS: For single-writer scenarios, sync/atomic can replace sync.Mutex on headIndex and tailIndex. However, timestamp comparison requires sequential consistency, making mutex-protected critical sections safer for general API gateways.
Client Isolation: Using sync.Map with per-client buffers prevents hot-key contention. Each client’s rate limit state is evaluated independently, enabling dynamic weight calculation based on tier or subscription level.

4. Configuration Blueprint & Tuning Parameters

Configuration directly dictates memory footprint and CPU utilization. The schema below maps operational constraints to runtime behavior.

rate_limiting:
 sliding_window:
 window_size_ms: 60000 # 1-minute evaluation horizon
 max_requests: 100 # Buffer capacity per client
 cleanup_interval_ms: 10000 # Background eviction frequency
 memory_limit_mb: 256 # Hard cap for process heap
 eviction_policy: "lru" # Fallback when memory_limit_mb is reached

Mapping to System Resources

Buffer Capacity: max_requests * 8 bytes (int64) per client. At 10,000 concurrent clients with max_requests: 100, memory consumption is ~7.6 MB. Exceeding memory_limit_mb triggers forced LRU eviction.
CPU Cycles: cleanup_interval_ms controls background goroutine frequency. Too low (<1s) causes thread contention; too high (>30s) allows stale client maps to bloat.
Trade-Off Analysis: As detailed in Fixed Window vs Sliding Window, sliding windows consume 3–5x more memory and CPU than fixed counters. When max_requests exceeds 10,000 or window_size_ms drops below 100ms, computational overhead justifies migrating to token-bucket approximations or distributed probabilistic counters.

5. Failure-Mode Analysis & Edge Case Mitigation

In-memory sliding windows are highly performant but introduce specific failure surfaces. Systematic mitigation is required for production API gateways.

Failure Mode	Reproduction / Impact	Mitigation Strategy
Race Conditions	Concurrent writes to `headIndex` without atomic guards cause double-counting.	Use `sync.Mutex` per client buffer or `sync/atomic.CompareAndSwapInt64` for pointer advancement. Never mutate buffer state without holding the lock.
Memory Leaks	Stale client IDs persist indefinitely if `EvictStaleEntries` fails or `tailIndex` never advances.	Implement periodic LRU sweep with hard `memory_limit_mb` enforcement. Force-delete clients exceeding TTL regardless of buffer state.
Clock Skew	NTP syncs or system hibernation cause `now - window` to jump backward, invalidating timestamps.	Anchor all calculations to monotonic clocks. Reject wall-clock deltas >500ms as anomalous and trigger circuit breaker fallback.
Burst Overload	Sudden traffic spikes fill buffers faster than eviction runs, causing O(N) scan latency.	Implement token-bucket fallback for >2x baseline RPS. Queue excess requests with exponential backoff before hard rejection.
GC Pressure	High-frequency timestamp allocations trigger stop-the-world pauses >10ms.	Pre-allocate fixed-size buffers at startup. Use object pooling for map entries. Zero dynamic allocations in the `RecordRequest` hot path.

Fallback Architecture

When memory limits are breached or GC pauses exceed 5ms, route traffic to a circuit breaker state. Return 429 Too Many Requests with Retry-After headers derived from window_size_ms. Log overflow events to metrics pipelines for capacity planning.

6. Performance Validation & Production Readiness

Deploying in-memory sliding windows requires rigorous load testing and observability integration. The following methodology ensures platform teams can validate latency and memory constraints before traffic exposure.

Benchmarking Framework Setup

Toolchain: k6 or wrk2 for sustained RPS injection. Target 50k–100k concurrent virtual users.
Metrics Collection: Instrument P99 latency, heap allocation rate (alloc_rate), and GC pause times (gc_pause_ns).
Success Criteria:

P99 latency < 2ms under 90% of max_requests load
Heap growth < 5% over 15-minute steady state
GC pause < 5ms (99th percentile)

Heap Profiling Commands

# Capture allocation hotspots
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

# Trace GC pauses and scheduler latency
go tool trace -http=:8080 trace.out

# Monitor live memory pressure
watch -n 1 'ps -o pid,rss,vsz -p $(pgrep gateway)'

Production Rollout Checklist

Canary Deployment: Route 1–5% of traffic to the new rate limiter. Compare 429 **Canary Deployment**: Route 1–5% of traffic to the new rate limiter. Compare `429` rejection rates against baseline.
Alert Thresholds: Configure alerts for heap_usage > 80%, gc_pause_p99 > 5ms, and eviction_queue_depth > 1000 **Alert Thresholds**: Configure alerts for `heap_usage > 80%`, `gc_pause_p99 > 5ms`, and `eviction_queue_depth > 1000`.
Graceful Degradation **Graceful Degradation**: Ensure fallback to fixed-window counters or upstream proxy limits triggers automatically on OOM or mutex deadlock.
State Persistence **State Persistence**: For rolling deployments, implement warm-start cache population or accept stateless cold-start for the first window cycle.
Documentation: Publish client-facing headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset **Documentation**: Publish client-facing headers (`X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`) aligned with the sliding window calculation.

When validated against these criteria, in-memory sliding windows deliver deterministic, sub-millisecond rate limiting suitable for high-throughput microservice architectures.

Frequently Asked Questions

When should I use a raw timestamp buffer versus the weighted approximation?

Use the raw circular buffer when you need exact counts and max_requests per client is small (say under a few hundred), since memory is max_requests × 8 bytes. Switch to the weighted two-counter approximation when limits are large or you have millions of keys — it is strict O(1) per client at the cost of a small approximation error near the boundary.

Why monotonic clocks instead of wall-clock time?

An NTP step or hibernation can move wall-clock time backward, producing negative deltas that corrupt the window boundary and let traffic through. Monotonic sources (process.hrtime.bigint() in Node, the monotonic reading embedded in Go's time.Now()) only ever move forward, so now - window stays correct.

Does an in-memory window hold a global limit across nodes?

No. Each process keeps its own buffer, so N nodes enforce up to N times the limit in aggregate — the same per-node overshoot that affects any local counter. If the advertised limit is global, back it with a shared store; an in-memory window is correct only for single-node deployments or coarse per-node caps.

How do I stop inactive clients from leaking memory?

Run a periodic sweep (the EvictStaleEntries routine) that deletes any client whose buffer count has drained to zero, and enforce a hard memory_limit_mb with LRU eviction as a backstop. Without eviction, the per-client map grows unbounded as one-off IPs accumulate.

Fixed Window vs Sliding Window — the parent guide and the boundary-spike motivation.
Fixed Window Counter Drift Explained — the exact artifact this implementation removes.
Sliding Log Counters — the distributed, exact-accounting counterpart.
Redis vs In-Memory Token Bucket — when local state is enough versus a shared store.