Your Application Code Is Fine. Your Edge Is Not.
Why most production outages in small API platforms happen at the edge layer, not in business logic
Post-mortems from small API platforms tell the same story over and over.
The application code didn't fail. The database didn't fail. The background jobs ran fine. The incident was at the edge — a reverse proxy misconfiguration, a stale DNS cache, a TLS certificate renewal that failed silently — and the application was unreachable for 40 minutes while the team debugged the wrong thing.
Engineers who spend most of their time in application code tend to underestimate the edge layer. It's infrastructure. Someone else set it up. It runs without attention for months. And then it fails in ways that are silent, confusing, and difficult to debug without specific knowledge of how that layer works.
This post covers what the edge layer is responsible for, where it fails, and how to operate it safely.
I – What the Reverse Proxy Is Actually Doing
A reverse proxy sits in front of your application server. Clients talk to the proxy. The proxy talks to your application. The proxy is responsible for far more than people realize.
TLS termination. HTTPS connections end at the proxy. The proxy decrypts them and forwards plaintext to the application. The application doesn't need to handle TLS.
Request routing. Path-based or subdomain-based routing to the right backend service. /api/* routes to the API service. / routes to the web frontend.
Header forwarding. The proxy must forward the original client IP and protocol in headers the application can read. X-Forwarded-For: 203.0.113.42 tells the application the real client IP. X-Forwarded-Proto: https tells the application the original protocol.
Rate limiting and request filtering. Coarse-grained rate limiting at the network layer, before requests hit application code.
SSL redirect. HTTP requests redirected to HTTPS.
If the reverse proxy doesn't forward headers correctly, your application will see the proxy's IP as the client IP. Rate limiting by IP will rate-limit the proxy, not the real client. Access logs will show one IP for all requests. Authentication systems that use IP as a secondary factor will fail. These bugs are easy to introduce and difficult to notice until something goes wrong.
II – Header Forwarding Correctness
The three headers that matter:
# nginx configuration
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header Host $host;
X-Forwarded-For is a comma-separated list. If the client is 203.0.113.42 and the proxy is 10.0.0.1, the header value is 203.0.113.42, 10.0.0.1. The application should read the first IP in the list, not the last.
$proxy_add_x_forwarded_for in nginx appends the client IP to any existing X-Forwarded-For header from an upstream proxy. This is correct for a single-proxy setup. For multi-layer proxy setups, understand the full IP chain before configuring this.
The security implication: X-Forwarded-For can be spoofed by the client. A client can send X-Forwarded-For: 1.2.3.4 and the proxy will append the real client IP: 1.2.3.4, 203.0.113.42. If your application reads X-Forwarded-For and uses the first IP for authentication or access control, it's using a client-controlled value.
The safe pattern: trust X-Forwarded-For only when all entries in the list come from trusted proxies. For a single-proxy setup, the application should read X-Real-IP (set by the proxy from the direct connection) rather than the full chain.
III – Path-Based vs Subdomain Routing
Two routing strategies. Different operational characteristics.
Path-based routing: api.example.com/v1/memories and api.example.com/v1/projects route to the same service. The path distinguishes the resource.
Subdomain routing: api.example.com routes to the API service. cdn.example.com routes to the CDN. Subdomains separate services.
For most API platforms, subdomain routing is cleaner. It separates concerns at the DNS level, which means you can route different services to different infrastructure without changing application routing logic. api.example.com and web.example.com can run on completely different servers with different TLS certificates and different reverse proxy configurations.
Path-based routing for the same service makes sense when you're running multiple API versions: api.example.com/v1/ and api.example.com/v2/ can route to different application versions during a migration.
The practical rule: use subdomains to separate distinct services. Use paths to version or namespace within a service.
IV – DNS Design
DNS changes are slow and hard to undo quickly.
Two principles that save outages:
Low TTL before changes, high TTL for stable records. A DNS record with a 3600-second TTL will be cached for an hour after you change it. Clients that cached the old record will continue hitting the old server for up to an hour. Before making a DNS change, lower the TTL to 60 seconds and wait for the old TTL to expire. Then make the change. Now caches expire in 60 seconds instead of 3600.
After the change is stable, raise the TTL back to 3600. Low TTL means more DNS queries, which means more DNS cost and slightly higher latency on first connection. High TTL for stable records is the right default.
Separation of API and web DNS. Your API DNS (api.example.com) and your web DNS (www.example.com) should be independently changeable. If they share an IP address or a CNAME chain, changing one changes both. That's a coupling that will cause you to make mistakes when you need to move one service without moving the other.
V – TLS Automation and Renewal Safety
TLS certificates expire. If the certificate expires and isn't renewed, your API is unreachable for clients that enforce certificate validity (all of them).
Let's Encrypt with certbot (or equivalent) automates certificate issuance and renewal. The automation is reliable — until it isn't.
The failure modes of certificate automation:
Renewal race with web server. certbot's HTTP-01 challenge requires serving a file at /.well-known/acme-challenge/. If your nginx config doesn't have a rule for this path, the challenge fails, the renewal fails, and the certificate expires.
# Required for Let's Encrypt HTTP-01 challenge
location /.well-known/acme-challenge/ {
root /var/www/certbot;
}
Silent renewal failure. certbot runs in a cron job. The cron job fails silently (wrong user, wrong path, Python error). Nobody notices until the certificate expires and alerts fire. Monitor the certificate expiry date externally: set a monitoring check that alerts at 14 days before expiry. This gives you two weeks to investigate and fix a silent renewal failure before it becomes an outage.
.dev domain implications. .dev domains are HSTS-preloaded. This means browsers enforce HTTPS for these domains with no exceptions — HTTP requests are not possible, even for the ACME challenge. Use the DNS-01 challenge instead of HTTP-01 for .dev domains. DNS-01 proves domain ownership by creating a DNS TXT record, which doesn't require your web server to be reachable over HTTP.
VI – HSTS and Its Permanence
HSTS (HTTP Strict Transport Security) tells browsers to always use HTTPS for your domain, even if the user types http://. This is a security improvement. It's also a commitment with consequences.
The max-age directive controls how long browsers remember the HSTS policy:
Strict-Transport-Security: max-age=31536000; includeSubDomains
max-age=31536000 is one year. If you serve this header and then your HTTPS stops working (certificate expires, TLS misconfiguration), users cannot reach your site over HTTP as a fallback. For the max-age duration, they will only attempt HTTPS.
Start with a short max-age (86400, one day). Verify your HTTPS configuration is stable. Then increase to 31536000.
includeSubDomains means the policy applies to all subdomains. Don't add this until you're confident every subdomain supports HTTPS correctly.
HSTS preloading is an additional step where your domain is hardcoded into browsers. This is permanent — removing your domain from the preload list takes months and doesn't affect users who already have the cached list. Only submit for preloading once you are certain you will always want to enforce HTTPS.
VII – What Breaks First
DNS misrouting and stale caches. You changed the DNS record for api.example.com to point to the new server. Requests are failing. The TTL was 3600 seconds and you changed it 10 minutes ago. Clients that cached the record are still hitting the old server. Fix: lower TTL before changes, verify cache expiry before changing, keep the old server alive until the TTL has fully expired.
Cert issuance timing failures. You're setting up a new subdomain and provisioning a certificate. The DNS change to point the subdomain to your server hasn't fully propagated yet. The Let's Encrypt HTTP-01 challenge fails because the subdomain still points to the old server. Fix: verify DNS propagation before triggering certificate issuance. dig api.example.com from multiple resolvers. Wait until all resolvers agree before initiating.
Misconfigured proxy headers breaking auth or URLs. The reverse proxy is not forwarding X-Forwarded-Proto. The application constructs callback URLs and generates http:// links instead of https://. OAuth redirects fail. Webhook delivery fails. Fix: test header forwarding explicitly. A test endpoint that echoes back headers is useful during setup. GET /echo-headers → response body contains all received headers.
Edge Change Checklist
- DNS TTL lowered to 60s at least 1 TTL cycle before any DNS change
-
X-Forwarded-For,X-Forwarded-Proto,X-Forwarded-Hostforwarded correctly -
/.well-known/acme-challenge/served correctly for Let's Encrypt - Certificate expiry monitored with 14-day alert
- HSTS
max-agestarted at 86400 before increasing - Old routes kept alive for 24h after DNS cutover
DNS Cutover Playbook
- Lower TTL to 60s on source record. Wait for
(old_ttl)seconds. - Add new DNS record pointing to new target.
- Verify new record resolves from multiple DNS resolvers.
- Monitor error rates on old target — should approach zero as caches expire.
- After 10x the new TTL with zero traffic on old target: decommission old target.
- Raise TTL to 3600s.
The edge layer is the part of your infrastructure that users see when everything else is working. Operate it carefully.
0 comments