You Don't Need Distributed Rate Limiting. Yet.
Build simple, visible, enforceable limits before you build complex ones
Every team building an API gets rate limiting wrong in the same direction. They over-engineer it.
They read about token bucket algorithms. They deploy Redis with atomic Lua scripts. They design a multi-tier limiting system with burst windows, daily quotas, and per-endpoint overrides. They spend three weeks building it.
And then they launch with 50 users and the limiting system is never stress-tested in production because the load never gets close to any of the limits.
Meanwhile, the simple problems go unsolved: limits aren't communicated to developers, quota exhaustion isn't visible to users, and when someone legitimately hits a limit, the error message says "internal server error."
Rate limiting should start simple. Visible. Enforceable. Then, and only then, does it become distributed — when the actual traffic patterns of an actual production system demand it.
This post is about building the right thing at each stage.
I – Three Goals, One Priority Order
Rate limiting serves three goals. Know them in order.
Abuse prevention. A single actor must not be able to exhaust shared resources. A burst of 10,000 requests in one second from one API key should not affect other users.
Fairness. In a multi-tenant system, resource consumption must be bounded per tenant. One tenant's heavy workload must not degrade service for others.
Cost control. Your infrastructure has costs. Embedding API calls, vector queries, database reads — all of these have a per-unit cost. A tenant consuming 1,000x their expected usage must be blocked before the cost is externalized to you.
These goals are listed in priority order for early-stage systems. Abuse prevention is a day-one concern. Fairness becomes important when you have multiple active tenants. Cost control becomes urgent when you have high-cost operations and variable consumption patterns.
II – Start with Database-Backed Counters
The simplest rate limiter that works: a usage table in your database.
CREATE TABLE usage_daily (
project_id TEXT NOT NULL,
date DATE NOT NULL DEFAULT CURRENT_DATE,
operation TEXT NOT NULL,
count INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (project_id, date, operation)
);
On every API call:
INSERT INTO usage_daily (project_id, date, operation, count)
VALUES ($projectId, CURRENT_DATE, $operation, 1)
ON CONFLICT (project_id, date, operation)
DO UPDATE SET count = usage_daily.count + 1
RETURNING count;
The returning count tells you whether the limit has been exceeded:
if count > project_quota.daily_limit:
return 429
This implementation is not perfectly consistent under concurrent load. Two requests can both read a count of 999 and both allow through when the limit is 1000. For a daily limit, this over-allowance of a few requests is almost always acceptable. The limit is a fairness control, not a hard contract.
This counter approach works well for:
- Daily limits (reset at midnight UTC)
- Moderate traffic (< 100 req/s per project)
- Single-region deployments
The database counter's advantages: it's auditable (you can query the usage table), it persists across restarts, and it doesn't require additional infrastructure.
III – In-Memory + Persistent Hybrid
Database counters have latency. If every API call requires a synchronous INSERT ... ON CONFLICT, that's an extra database round-trip on every request.
The hybrid approach: maintain an in-memory counter that periodically flushes to the database.
in-memory: { "proj_abc:2024-03-01:recall": 847 }
database: { "proj_abc:2024-03-01:recall": 800 } // last flush
The in-memory counter is incremented on every request (fast, no I/O). Every 30 seconds, the in-memory count is flushed to the database: UPDATE usage_daily SET count = $inmemory WHERE .... The limit check reads from the in-memory counter.
The tradeoff: if the service restarts, you lose the in-memory count since the last flush. The database shows 800 requests but the actual count was 847. After restart, the counter resets to the database value. The tenant gets an effective discount of 47 requests.
For daily limits, this is acceptable. You're not in the business of charging by the millisecond. For burst limits that need to be precise, you need a different approach.
IV – Burst Limits
Daily limits prevent exhaustion over time. Burst limits prevent exhaustion over a short window.
A reasonable burst limit: N requests per minute per project.
Implement with a sliding window using a Redis sorted set:
key = f"burst:{project_id}"
now = time.time()
window_start = now - 60 # 60-second window
# Add this request timestamp
redis.zadd(key, {str(now): now})
# Remove timestamps outside the window
redis.zremrangebyscore(key, 0, window_start)
# Count requests in window
count = redis.zcard(key)
# Set TTL to auto-expire the key
redis.expire(key, 120)
if count > burst_limit:
return 429
This is a sliding window counter. Every request adds a timestamp. Requests older than 60 seconds are removed. The count reflects requests in the last 60 seconds exactly.
Redis burst limiting requires Redis. If you're not running Redis yet, start with a simpler fixed-window approach: count requests in the current minute bucket, block if > limit. Fixed windows allow a burst of 2x the limit at window boundaries (end of minute N + start of minute N+1), which is usually fine.
V – Response Headers That Developers Can Use
The most undervalued part of rate limiting: telling developers what's happening.
Standard headers:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1709251200
Retry-After: 3600
X-RateLimit-Limit: the total limit for the current period.
X-RateLimit-Remaining: remaining requests before the limit is hit.
X-RateLimit-Reset: Unix timestamp when the counter resets.
Retry-After: seconds until the client can retry (only on 429 responses).
These headers allow developers to build adaptive clients that slow down before hitting limits rather than experiencing 429s. A client that checks X-RateLimit-Remaining and introduces backoff at 20% remaining will never hit a 429 in normal operation.
Return these headers on every response, not just on 429s. If a developer only sees rate limit headers when they're blocked, they can't build proactive rate limiting into their integration.
VI – Usage Metering Data Model
Limiting is one thing. Metering is another. Metering answers "how much has this tenant used?" Limiting answers "should I block this request?"
They share data but serve different purposes. The metering data model needs to support:
- Per-operation breakdown (recall vs. remember vs. context)
- Time-range aggregations (daily, monthly)
- Billing events (for paid tiers)
CREATE TABLE usage_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id TEXT NOT NULL,
operation TEXT NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
units INTEGER NOT NULL DEFAULT 1,
metadata JSONB
);
-- Aggregation view
CREATE MATERIALIZED VIEW usage_daily_summary AS
SELECT
project_id,
operation,
DATE(occurred_at) AS date,
SUM(units) AS total_units
FROM usage_events
GROUP BY project_id, operation, DATE(occurred_at);
Usage events are immutable. You don't update them; you insert them. This makes the table an audit log of resource consumption, not just a counter.
The materialized view provides fast aggregation for dashboards and quota checks without scanning the full event table.
VII – What Breaks First
Counter resets at wrong boundaries. Daily limits reset at midnight. But midnight in which timezone? If your counters reset at midnight UTC and your customer is in UTC-8, their limit resets at 4pm local time. They get two limit windows per business day. A competitor resets at midnight in the customer's timezone and gets one. This isn't a security issue, but it's confusing enough to generate support tickets. Pick a reset policy, document it, and apply it consistently.
Inconsistent limits across replicas. In-memory counters on two application instances are independent. Instance A has counted 800 requests. Instance B has counted 820 requests. Neither knows about the other. The tenant is at 820+800 = 1620 total requests against a limit of 1000. If you're using in-memory counters without synchronization, this is your production behavior. Either use synchronized counters (Redis) or accept the over-allowance and tune your limits conservatively.
Hidden quota exhaustion. A user hits their daily limit at 2pm. They don't get rate limit headers (you didn't implement them). They get a 500 error (you didn't implement the 429 properly). They spend an hour debugging their integration before filing a support ticket. Every minute they're blocked and don't know it is a minute of frustration. Headers and clear error messages are not optional.
Limiter Architecture Decision Table
| Constraint | Recommended Approach |
|---|---|
| < 100 req/s, single region | Database-backed daily counter |
| < 1,000 req/s, burst control needed | In-memory + database flush + Redis burst |
| > 1,000 req/s, multi-region | Distributed counter (Redis Cluster or similar) |
| Per-user burst within a tenant | Sliding window per user_id |
| Hard billing cap | Database counter (must be durable, not approximate) |
Quota Communication Spec
Every 429 response must include:
- HTTP status 429
Retry-Afterheader with seconds until resetX-RateLimit-Resetheader with Unix timestamp- Body:
{"error": "rate_limit_exceeded", "limit": N, "reset_at": "ISO8601"}
Every non-429 response must include:
X-RateLimit-LimitX-RateLimit-RemainingX-RateLimit-Reset
Build it simple first. Operate it. Learn the actual failure patterns from actual production traffic. Then add complexity only where the patterns demand it.
0 comments