You Will Have a Credential Leak. The Question Is Whether You're Ready.
Credential leak response must be pre-designed. Improvisation guarantees longer exposure windows.
It is not a question of if. It is a question of when.
A developer will commit an API key to a public repository. A customer will post their key in a support ticket. A build log will leak credentials that were passed as environment variables. A contractor will include keys in documentation. A key will appear in a screenshot in a blog post.
These things happen to teams that are careful. They happen to teams with strong security cultures, secret scanning, and developer training. They happen because humans make mistakes and keys are designed to be easy to copy and paste.
What separates a 15-minute incident from a 3-day incident is not whether you had a security program. It's whether you had a pre-designed response.
Improvisation during an incident means decision-making under stress by people who don't have all the information and aren't sure what to do next. It means delayed revocation while the team debates severity. It means incomplete rotation because nobody knew what else was using the key. It means a postmortem that concludes "we need a runbook" instead of being a runbook that already existed and worked.
Build the response before you need it. This is how.
I – Incident Classification
Not all credential leaks are the same severity. Classify before you act.
P0 — Active exploitation. Evidence of unauthorized use: unusual API traffic patterns, unexpected operations in audit logs, abuse reports from customers. Act immediately.
P1 — Confirmed leak, potential exposure. The key appears in a public repository, a public Slack message, a posted screenshot. No evidence of exploitation yet, but the window is open. Act within 15 minutes.
P2 — Suspected leak. The key may have been exposed in a build log, a shared document, or a communication that could have been viewed by unauthorized parties. No confirmation. Act within 1 hour, starting with investigation.
P3 — Internal leak. The key was shared internally beyond its intended audience (shared in a team chat instead of a secrets manager). Lower risk, but still requires rotation within 24 hours.
Classification determines urgency, not whether you act. Every classified incident gets resolved. P0 incidents get resolved faster.
II – Detection Channels
Know where leaks will be reported from.
Automated secret scanning. GitHub secret scanning alerts when a registered key prefix appears in a public repository. Configure this. Register your key prefix with GitHub's secret scanning partner program. This is the fastest channel — alerts arrive within minutes of a commit.
Abuse spike detection. Monitoring that alerts when a single key's request volume spikes anomalously. A key that normally does 100 requests per day doing 10,000 in an hour is either a legitimate automation change or a compromised key. Alert on both and investigate.
Customer reports. A customer notices unexpected operations in their project. They open a support ticket. This is a slow channel (the key was likely in use before the customer noticed), but it's real and happens.
Internal reports. A developer realizes they committed a key. They report it through an internal security channel. This is the ideal channel — fast detection, clear information about where the key appeared.
Proactive scans. Regular automated scans of your own build logs, deployment artifacts, and exported data for key patterns. This catches leaks in environments where automated scanning isn't built in.
III – Immediate Containment (The First 15 Minutes)
The first action is always revocation. Before investigation. Before blast radius analysis. Before customer notification.
Revoke first. Investigate second.
This is counterintuitive. Teams often want to understand the full picture before acting. But every minute the key remains active is a minute an attacker can use it. Revocation is reversible in one sense — you can issue a new key. Unauthorized access to your data is not reversible.
The revocation action:
# Emergency revoke
ctxvault keys revoke --key-id=key_abc123 --reason="security_incident" --immediate
This must be:
- A single command that any on-call engineer can run
- Available from any network (not behind VPN that might be down)
- Logged automatically with the reason and the actor who performed the revocation
- Immediately effective (not batched or delayed)
After revocation, the key is dead. Traffic using that key starts returning 401. The blast radius is now closed.
The common objection: "But what if that key is being used by a production system and revoking it causes an outage?"
The answer: a production system using a compromised key is already in a worse state than an outage. A brief outage during key rotation is recoverable. Ongoing unauthorized access to production data is not.
IV – Blast Radius Analysis
After revocation, understand the damage.
The blast radius of a compromised key is determined by:
-
Scope. What projects was the key allowed to access? What operations could it perform? A read-only key on one project has a much smaller blast radius than a global read-write key.
-
Usage period. From when was the key potentially compromised to when was it revoked? Query audit logs for all operations performed by this key in that period.
-
What was accessed. For each operation in the audit log, what data was read or written? Was any sensitive data exposed?
SELECT
event_type,
target_id,
target_type,
occurred_at,
metadata
FROM audit_events
WHERE actor_id = 'key_abc123'
AND occurred_at BETWEEN $leak_start AND $revocation_time
ORDER BY occurred_at ASC;
This query tells you everything the key did. If the audit log is complete (it should be), this is a complete picture of the exposure.
Common findings:
- The key was never used after it leaked (attacker hasn't found it yet, or you found it before they did)
- The key was used from an unexpected IP range (attacker had it)
- The key was used to read data that should be notified to affected customers
V – Rotation Workflow
Revocation stops the bleeding. Rotation rebuilds the wounded system.
The rotation workflow is about replacing the revoked key everywhere it was used:
-
Inventory all consumers. Query
last_used_ipandlast_used_user_agentfrom the key's telemetry. Every distinct consumer of the key needs a new key. -
Issue new keys with reduced scope. If the compromised key was a global read-write key, issue replacement keys that are project-scoped and have only the minimum necessary permissions. Use the incident as an opportunity to scope down.
-
Deploy new keys to each consumer. For CI pipelines, update the secrets manager configuration. For application services, update environment variables and restart. For human users, issue the new key directly.
-
Verify each consumer is operational. After deploying a new key, verify the consumer is making successful API calls. Don't assume rotation succeeded — check.
-
Confirm the revoked key is no longer in use. Monitor for any remaining traffic using the old key ID. There should be none. If there is, you've missed a consumer.
VI – Customer Communication
If the breach exposed customer data, customers need to know.
The communication timeline:
- T+0: Revoke the key
- T+15m: Determine blast radius
- T+30m: Notify affected customers if their data was accessed
- T+24h: Publish a public incident report (if the breach was known publicly)
The notification must include:
- What happened (in plain language, not security jargon)
- When it happened
- What data was affected (specific, not vague)
- What you've done to contain it
- What the customer should do (usually nothing, but be explicit)
- A contact point for questions
What you should not do: wait until you have the full picture before notifying. Customers deserve early notification even if the investigation is ongoing. Update them as you learn more.
VII – What Breaks First
Delayed revocation due to missing tooling. The on-call engineer knows the key needs to be revoked. They can't find the revocation command. They try to find someone who knows how to do it. 45 minutes pass. Revoke the key in a single command that every engineer on the team has used before. Practice it. Put it in the runbook.
Incomplete dependency rotation. The key was rotated in CI. It was rotated in the primary application service. It was not rotated in the data export job, the monitoring integration, and the developer's local environment. Those three consumers are still using the old key, which is revoked. They break silently or generate errors. Fix: before rotation, enumerate all consumers from the key's telemetry. Rotation is complete when every consumer is operational on the new key.
Lack of audit trail for impact analysis. The breach is confirmed. You query the audit log to determine what was accessed. The audit log is missing operations from the suspected exposure window — the audit log was only recently implemented and the breach predates it. You cannot accurately assess the blast radius. Fix: comprehensive audit logging from day one. If you don't have it, the impact analysis is incomplete and any customer notification will be vague.
60-Minute Containment Checklist
| Minute | Action |
|---|---|
| 0-5 | Classify severity. Revoke compromised key immediately. |
| 5-15 | Query audit logs for all operations since potential compromise. |
| 15-30 | Assess blast radius: what data was accessed, by whom, from where. |
| 30-45 | Issue replacement keys with reduced scope to all known consumers. |
| 45-55 | Verify all consumers are operational on new keys. |
| 55-60 | Confirm revoked key is generating zero traffic. Incident contained. |
Customer Communication Template
Subject: Security Notice: API Credential Exposure
We're writing to inform you of a security incident affecting your account.
What happened: An API key associated with your account [key prefix:
cvk_xyz_] was exposed in [describe context] on [date].When: The key was active from [creation date] to [revocation timestamp].
What was accessed: During the exposure window, the key was used to [read/write] [describe what]. Specifically: [list specific data types or confirm no access occurred].
What we did: We revoked the key at [time]. We have confirmed no access has occurred since revocation.
What you should do: We have issued a replacement key. Please update your integrations with: [new key or link to key management UI]. Review any operations in your audit log from the exposure window.
If you have questions, reply to this email or contact security@[domain].
The response you build before the incident is the response that works. The one you write at 2am under pressure is the one that misses things.
0 comments