securitybackendapidata-loss-prevention

The Cheapest Breach Is the One That Never Happened

How to build data loss prevention into your API's write path before secrets reach your database

July 19, 20237 min read

At some point, a user will send you an AWS access key.

Not because they're attacking you. Because they're storing context about their own infrastructure. Because they pasted the wrong thing. Because they're testing your API and used a real credential by accident.

If you accept it, you own the problem. The credential is in your database. It will appear in backups, in exports, in support queries, in log lines. It will be there the next time a security researcher queries your API with an over-permissioned key. It will be there when you eventually get breached.

If you reject it at the write path, the problem never existed. The cheapest breach is the one that never happened.

Data loss prevention at ingestion is not complex. But most teams don't build it until after the first incident. This post is about building it first.

I – The Ingestion Threat Model

Text APIs that accept user-supplied content have a different threat model than most engineers think about.

The obvious threat: a malicious user exfiltrating secrets through your platform as a vector. Store a secret in API X, retrieve it later from API X, bypass their own company's DLP tools.

The non-obvious threat: accidental persistence. A developer integrating with your API pastes a config block that contains a database URL. A user summarizing their work notes includes a password they typed earlier. These users do not intend to store credentials. They will not notice it happened. But the credential is now in your database, and your system's security posture becomes entangled with theirs.

The third threat: your own infrastructure. What happens when you process and log user payloads for debugging? If the payload contains a secret and you log it, you've just exfiltrated a credential into your own log infrastructure.

All three of these are addressed by the same control: detect and block secrets at the write path, before they reach persistent storage.

II – Pattern Detection Tiers

Not all detection is equal. Build it in tiers, matched to confidence level.

Tier 1: High-confidence patterns. Known secret formats from major providers. These can be detected reliably with regex and should always be blocked.

AWS Access Keys: AKIA[0-9A-Z]{16}
GitHub Personal Access Tokens: ghp_[a-zA-Z0-9]{36}
Stripe Secret Keys: sk_(live|test)_[a-zA-Z0-9]{24,}
Private Keys: -----BEGIN (RSA |EC |OPENSSH )?PRIVATE KEY-----
Generic high-entropy strings (64+ hex chars, 40+ base64 chars in credential-like contexts)

Register these patterns with GitHub's secret scanning program as a secondary detection layer. It costs you nothing and catches leaks in git history.

Tier 2: Heuristic patterns. Things that look like secrets but require context. Connection strings (postgres://user:password@host), JWT tokens (three base64 segments joined by dots), bearer tokens in header-like formatting. These have a higher false-positive rate. Block or warn depending on your policy.

Tier 3: Entropy analysis. High-entropy strings above a threshold that don't match known patterns. Useful as a supplementary signal, not a primary blocker. Too many false positives on legitimate high-entropy data (UUIDs, hashes, base64-encoded content) to use as a standalone control.

III – Redacted Feedback That Developers Can Actually Use

The wrong response to a detected secret:

{
  "error": "Content rejected by security policy"
}

This tells the developer nothing. They will file a support ticket. They will not understand what triggered the policy. They will be frustrated.

The right response:

{
  "error": "content_rejected_dlp",
  "message": "Content contains potential secrets that cannot be stored.",
  "matches": [
    {
      "type": "aws_access_key",
      "redacted_match": "AKIA****EXAMPLE",
      "position": {"start": 142, "end": 162}
    }
  ],
  "action": "remove_or_redact_secrets_before_submitting"
}

The match is shown, redacted. The type is named. The position in the payload is given. The developer can find the offending content, remove it, and resubmit. No support ticket. No confusion.

The redaction is important. You are telling the developer where the secret is without echoing the secret back. Do not return the raw matched string. Return the type, a redacted preview, and the position.

IV – Policy Modes

A hard block is not always the right call. Build three modes.

Block. The write is rejected. A 422 is returned with the match details. The content is never persisted. Use this for Tier 1 high-confidence patterns in production.

Warn. The write is accepted but a warning is returned in the response headers or body. The content is persisted with a flag. Use this for Tier 2 heuristic patterns, or for Tier 1 patterns in sandbox/test environments where developers are intentionally using fake credentials.

Quarantine. The write is accepted, the content is stored, but it is flagged as quarantined and not returned in recall or retrieval until reviewed. Use this for patterns you're not confident enough to block but don't want in the active recall pool.

The policy mode should be configurable per pattern tier and per environment. A test environment running block on JWT tokens will drive developers insane. Production should be stricter than staging. Give teams control without removing the control entirely.

V – The Logging Problem

Here is a failure mode that ruins the whole system: you detect a secret at ingestion, log the detection event, and include the raw payload in the log.

Now the secret is in your logs. The detection worked and you still have a leak.

The logging rules are simple:

Never log raw payloads containing detected secrets
Log the detection event with: timestamp, actor, pattern type, redacted match, policy action taken
Log the request metadata (endpoint, project, payload size) without the payload
If you need payload logging for debugging, gate it behind an explicit opt-in flag that is off by default

GOOD: "dlp_match event=content_rejected type=aws_access_key actor=proj_abc redacted_match=AKIA****EXAMPLE"
BAD:  "dlp_match payload='...AKIAIOSFODNN7EXAMPLE...' rejected"

The same rule applies to error reporting. If a handler panics while processing a payload containing a secret, the stack trace and context should not include the raw payload.

VI – Body Size and Memory Controls

The secondary attack vector is not secrets — it's resource exhaustion.

An API that accepts arbitrary-length text payloads can be given a 500MB string. The DLP scanner will try to process it. It will consume memory. It will time out. In the best case, the request is slow. In the worst case, the service is degraded.

Set hard body size limits before the request reaches your handlers. At the reverse proxy level. Not in application code. At the nginx/caddy/haproxy level, before the request body is read into memory.

client_max_body_size 1m;  # nginx example

In application code, enforce a tighter limit appropriate to your content type. A memory system ingesting text notes doesn't need to accept 1MB payloads. 32KB is likely sufficient. 64KB is generous.

Large payload support, if you need it, is a separate chunked upload flow — not a larger limit on your standard write endpoint.

VII – What Breaks First

Weak pattern set. A pattern that matches AKIA followed by 16 alphanumeric characters catches standard AWS access keys. But AWS has issued keys with slightly different formats in specific services. Gitleaks and Trufflesecurity maintain up-to-date pattern sets. Use them rather than maintaining your own.

Overblocking. A heuristic pattern that matches UUIDs blocks users trying to store configuration that contains UUIDs. False positives erode trust in the system. Track false positive rates per pattern. Tune or demote patterns that trigger too often on legitimate content.

Detector bypass via encoding. A base64-encoded credential does not match a plaintext regex. An attacker who knows you run DLP can trivially encode their payload. For threat-actor scenarios this matters. For accidental persistence (the more common case) it doesn't — users don't accidentally base64-encode their secrets before pasting them. Calibrate your model accordingly.

DLP Policy Matrix

Pattern Tier	Environment	Policy Mode	Alert?
Tier 1 (high-confidence)	Production	Block	Yes
Tier 1 (high-confidence)	Test/Sandbox	Warn	No
Tier 2 (heuristic)	Production	Warn	Yes (if repeated)
Tier 2 (heuristic)	Test/Sandbox	Log only	No
Tier 3 (entropy)	All	Log only	No

Incident Escalation Thresholds

1 detection in 24h: Log, no alert
5 detections from same actor in 24h: Soft alert to security channel
20 detections from same actor in 24h: Hard alert, flag actor for review
Any Tier 1 detection of a live production credential (verified active): Immediate alert, notify affected credential owner if identifiable

These thresholds are starting points. Tune them based on your actual traffic patterns after two weeks of running the system in warn-only mode before switching to block.

The rule is simple. Reject what you don't need to store. You can't leak what you never accepted.

0 comments

I – The Ingestion Threat Model #

II – Pattern Detection Tiers #

III – Redacted Feedback That Developers Can Actually Use #

IV – Policy Modes #

V – The Logging Problem #

VI – Body Size and Memory Controls #

VII – What Breaks First #

DLP Policy Matrix #

Incident Escalation Thresholds #