API Keys Are Not Passwords. Stop Designing Them Like They Are.
A production-grade model for key format, storage, scope, and rotation
Most teams treat API keys like passwords. Store the hash. Compare on login. Done.
That's half the model. And the half they miss is what turns a leaked key into a production incident.
A password identifies a human at a browser. An API key identifies a service, a CI job, or an integration — it carries scope, it controls blast radius, and it needs to be rotatable without downtime. Designing it like a password ignores every one of those requirements.
This post is about what a production-grade API key system actually looks like — from format to storage to scope to rotation to the playbook you run when one leaks.
I – The Threat Model You're Not Thinking About
Before anything else, name what you're defending against.
Leakage in logs. API keys in HTTP headers or query parameters end up in access logs, error logs, load balancer logs, and browser history. If the key is the only thing needed to authenticate, log leakage is a breach.
Leakage in source history. Keys committed to git. Even after removal, git history is permanent. Every fork, clone, and CI runner that ran against that history has had access.
Overpowered keys. A single global key with access to all projects and all operations. One leak, full access. The blast radius is bounded only by your imagination.
Rotation breaking automation. A CI pipeline configured with a static key. The key leaks. You revoke it. The pipeline breaks. The team scrambles. The key gets re-added to production config without going through secret management. Repeat.
Each failure mode has a design solution. None of them are addressed by "hash the key and check it on login."
II – Key Format and Entropy
The format of an API key is not cosmetic. It's a security and operational decision.
A production-quality key format looks like this:
cvk_live_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0
Three parts: prefix, environment, random payload.
Prefix (cvk_): Identifies the issuer. Enables secret scanning tools (GitHub, Trufflesecurity, Gitleaks) to detect leaks automatically. This is table stakes. Register your prefix with GitHub's secret scanning partner program. It takes 20 minutes and will catch leaks you would otherwise miss.
Environment (live vs test): Live keys access real data. Test keys access sandboxed data. Conflating them causes real consequences when a developer accidentally runs a test script against production.
Random payload: Minimum 128 bits of cryptographically secure randomness. Use crypto/rand or equivalent. Not math/rand. Not UUID v4 (only 122 bits and widely recognized as weak for this purpose). Not a sequential ID.
The full key is never stored. The prefix is stored in plaintext for lookup. The full key hash is stored for verification.
III – Prefix Lookup and Hash Verification
This is the piece most implementations get wrong.
You cannot hash-compare an API key without first finding the record to compare against. Finding the record requires either scanning all keys (catastrophically slow) or storing something in plaintext for indexing.
The solution is prefix indexing.
The first N characters of the key (8-16 characters, fixed) are stored in plaintext as a lookup field. When a key arrives:
- Extract the prefix from the incoming key
SELECT * FROM api_keys WHERE prefix = $prefix AND status = 'active'- For each candidate (usually 1), compute
argon2.Verify(storedHash, incomingKey) - If verified, the key is valid
This keeps verification cost bounded. The lookup is an indexed table scan over a tiny candidate set. The hash verify runs once.
prefix: "cvk_live_a1b2" (stored plaintext, indexed)
hash: "$argon2id$..." (stored, never retrievable as plaintext)
The full key is gone after minting. It cannot be recovered. If a user loses it, they rotate.
IV – bcrypt vs Argon2: The Real Tradeoff
bcrypt is fine. Argon2 is better. Here's why it matters for API keys specifically.
bcrypt's cost factor is single-dimensional (time). Argon2 lets you tune time, memory, and parallelism independently. For an authentication system under load, this matters — you can tune Argon2 to be fast enough to not bottleneck your verification path while still being expensive enough to resist offline cracking.
The practical recommendation: Argon2id with memory=64MB, iterations=3, parallelism=2. This costs ~200ms on a single core. For a system doing thousands of API calls per second with connection pooling and caching, that's too slow for every-request verification.
The solution is short-lived verification caching. After a successful verify, cache the result for 60 seconds keyed on SHA256(rawKey). You get near-zero overhead for hot paths and full verification on cold requests.
Do not cache the raw key. Cache the verification result against the hash of the key. A compromised cache does not expose the raw key.
V – Scope Modeling
Every API key should carry a scope. A scope answers two questions: what projects can this key access, and what can it do?
The minimal scope model:
{
"allowed_projects": ["proj_abc123", "proj_def456"],
"capabilities": ["read", "write"],
"expires_at": null
}
allowed_projects: [] means no access. allowed_projects: ["*"] means all projects the owning account can access — not all projects in the system.
Capabilities gate operation types. A CI key might have ["read"]. A monitoring integration might have ["read", "list"]. Only human users doing setup need ["read", "write", "admin"].
This isn't just security theater. When a key leaks, scope determines blast radius. A read-only key can't write. A single-project key can't reach other projects. Scoped keys turn incidents into containable events.
VI – Rotation Without Downtime
The rotation failure pattern: revoke key A, configure key B, discover that three other services were also using key A, scramble.
Rotation that works:
- Overlap windows. Key A and key B are both valid simultaneously. The new key is configured. Services are migrated. Only then is the old key revoked.
- Rotation in the API, not by re-registration. A
/rotateendpoint mints a new key and returns it, while marking the old key as "rotating" (still valid for N minutes). - Automated dependency discovery. Before rotation, query last-used telemetry to find all services using key A. The migration list is the last-used set.
The mental model: rotation is a migration, not a replacement. Treat it accordingly.
VII – Telemetry That Matters
Every key should track:
last_used_at— updated on every successful verificationlast_used_ip— for anomaly detectionlast_used_user_agent— for client identificationverification_count— total call volume
With these four fields, you can:
- Detect keys being used from unexpected IPs (credential stuffing signal)
- Identify unused keys (rotation and cleanup candidates)
- Build anomaly alerts ("this key suddenly went from 10 req/day to 10,000 req/hour")
- Provide usage visibility to customers so they can self-serve rotation decisions
VIII – What Breaks First
Three things will go wrong before your key system matures.
A key appears in a GitHub commit. If your prefix is registered with secret scanning, GitHub alerts you within minutes. If it isn't, you find out when a customer tells you. The difference is whether you have 10 minutes of exposure or 10 days.
A global key gets used where a scoped key was expected. An integration built on a global key gets handed off between teams. Nobody knows what it accesses. Nobody can scope it down without breaking the integration. You're now paralyzed — you can't revoke without knowing the blast radius. Scoped keys from the start prevent this.
Rotation breaks a CI pipeline. The pipeline was configured with a key that expired. The alert fires at 2am. The fix is a 5-minute config change, but somebody has to be awake to do it. Pre-announcement + overlap windows prevent this.
Key Lifecycle State Machine
ACTIVE → ROTATING (overlap window active)
ROTATING → REVOKED (old key invalidated)
ACTIVE → REVOKED (immediate revocation on compromise)
REVOKED → (terminal, no recovery)
Rotation Checklist
- New key minted with same or reduced scope
- New key configured in all consuming services
- Old key verified unused (check
last_used_at) - Old key status set to
REVOKED - Audit log event emitted with rotation reason
- Anomaly alerts confirmed inactive for old key
Revocation SLA Policy
- Suspected compromise: revoke within 15 minutes
- Confirmed compromise: revoke immediately, rotate all keys in the same scope
- Planned rotation: 7-day overlap window, 24-hour pre-announcement
- Expired key cleanup: 30 days after expiry date, revoke if not already
These aren't suggestions. Define them before you need them. Improvising an SLA during an incident makes the incident worse.
0 comments