All posts
observabilitysecuritybackendreliability

You Can't Investigate What You Didn't Record

Audit and observability are data models first, dashboards second

At some point, you will need to answer a question about something that happened in the past.

A customer asks why they can't find a memory they stored three days ago. A security review asks who accessed a set of records between Tuesday and Thursday. An incident postmortem asks: what sequence of events led to this failure, and when did each one happen?

If you designed your logging and audit systems correctly, these questions take minutes to answer. You query the audit log, filter by actor and time range, and reconstruct the sequence of events.

If you didn't, these questions take hours. You're grepping through unstructured application logs, piecing together partial information from multiple sources, and you still can't be certain your reconstruction is complete.

The difference between these two outcomes is not how clever your incident response playbook is. It's whether you built the right data model before anything happened.

Audit and observability are data models first. Dashboards are secondary. This post is about the data model.

I – Audit vs Logs: Two Different Systems

Teams conflate audit events and application logs. They are different systems with different purposes, different retention policies, and different query patterns.

Application logs are operational: "the service started," "the request completed in 234ms," "the query returned 47 rows." They're high-volume, short-retention, and useful for debugging current behavior. They answer: what is happening right now?

Audit events are historical: "user X created record Y," "API key Z was revoked by admin A," "project P was accessed from IP Q." They're lower-volume, long-retention, and useful for forensics. They answer: what happened in the past, and who did it?

The test for whether something belongs in the audit log is this: does this event have a named actor who took a named action on a named resource, and do I need to be able to prove that it happened? If yes, it's an audit event.

Writes always belong in the audit log. Reads often belong in the audit log for sensitive resources. Infrastructure events (service restart, deployment) belong in application logs, not audit logs.

II – Actor Integrity

The most common audit system failure is actor spoofing.

The failure pattern: an audit event is created with the actor ID taken from the request body or URL parameters. A client that controls those values can set the actor to any ID they choose. The audit log shows user A performing an action, when user B actually performed it.

The rule: the actor in an audit event must come from the authenticated identity, not from any client-supplied field.

// Wrong
func handleDeleteMemory(ctx Context, req DeleteRequest) {
    db.CreateAuditEvent(AuditEvent{
        Actor:  req.Body.ActorId,  // client-controlled! Never do this
        Action: "memory.delete",
        Target: req.Params.MemoryId,
    })
}

// Correct
func handleDeleteMemory(ctx Context, req DeleteRequest) {
    db.CreateAuditEvent(AuditEvent{
        Actor:  ctx.Identity.Id,   // from authenticated token, not client input
        Action: "memory.delete",
        Target: req.Params.MemoryId,
    })
}

The ctx.Identity is set by authentication middleware, not by the request. It cannot be spoofed by a client. It is the only valid source for the actor field in an audit event.

III – Audit Event Schema

Consistency in the audit event schema is what enables querying across time and across event types.

{
  "event_id":    "evt_abc123",
  "event_type":  "memory.created",
  "actor_id":    "key_def456",
  "actor_type":  "api_key",
  "project_id":  "proj_ghi789",
  "target_id":   "mem_jkl012",
  "target_type": "memory",
  "occurred_at": "2024-03-01T14:22:31.456Z",
  "request_id":  "req_mno345",
  "metadata": {
    "content_length": 247,
    "importance": "high",
    "source_ip": "203.0.113.42"
  }
}

Every field serves a purpose.

event_id: globally unique. Enables deduplication.

event_type: namespaced with dot notation. resource.action. memory.created, key.revoked, project.deleted. The namespace makes it easy to query all events for a resource type: WHERE event_type LIKE 'key.%'.

actor_id + actor_type: the combination identifies the actor completely. A user and an API key might share an ID in some systems. The type disambiguates.

target_id + target_type: the resource acted upon.

occurred_at: with millisecond precision. Timezone-aware (UTC).

request_id: ties the audit event to the corresponding application log entries. This is how you correlate "the audit event shows a delete" with "the application log shows a 403 before the delete succeeded."

metadata: flexible JSON for event-specific context. Does not include raw request bodies. Does not include credentials. Does not include PII beyond what's necessary for the audit purpose.

IV – Structured Logging

Application logs that are not structured are nearly useless at scale.

A plain-text log entry like:

2024-03-01 14:22:31 ERROR Failed to process recall request for project proj_ghi789: timeout after 5000ms

Contains useful information buried in a string. You can't efficiently filter it, aggregate it, or join it with other data.

The same event, structured:

{
  "level":       "error",
  "message":     "recall request failed",
  "timestamp":   "2024-03-01T14:22:31.456Z",
  "request_id":  "req_mno345",
  "project_id":  "proj_ghi789",
  "operation":   "recall",
  "error":       "timeout",
  "duration_ms": 5000
}

Every field is a structured key-value pair. You can filter by project_id, group by operation, alert on error = "timeout", and join with audit events by request_id.

Use slog (Go), structlog (Python), or a JSON-first logger in every language. The key fields that must appear in every log entry:

  • timestamp (UTC, millisecond precision)
  • level (error/warn/info/debug)
  • request_id (propagated from the request context)
  • project_id (if the log is in the context of a project operation)
  • message (human-readable, brief)

Additional fields are context-dependent. A database query log includes query_duration_ms and table. An authentication log includes actor_type and auth_method.

V – Correlation IDs and Request Tracing

Every request that enters the system gets a unique request ID. That ID travels through every log entry, every audit event, and every downstream call made during that request.

func RequestIDMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        requestID := r.Header.Get("X-Request-ID")
        if requestID == "" {
            requestID = generateRequestID()
        }

        ctx := context.WithValue(r.Context(), "request_id", requestID)
        w.Header().Set("X-Request-ID", requestID)

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

The request ID is:

  • Accepted from incoming X-Request-ID header (if the client generated one)
  • Generated server-side if not provided
  • Attached to the response as X-Request-ID
  • Included in every log entry generated during the request
  • Included in every audit event generated during the request

When a customer reports an issue with their request, they can include the X-Request-ID from the response headers. You can query all logs and audit events for that ID instantly.

VI – Health Endpoints and What They Actually Tell You

A health endpoint that always returns 200 is theater.

The health endpoint's job is to tell a load balancer, a deployment system, or an on-call engineer whether this service is actually capable of serving requests right now.

A health endpoint that does that:

{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 3
    },
    "vector_index": {
      "status": "degraded",
      "latency_ms": 847,
      "message": "Vector query latency above SLO threshold (200ms)"
    },
    "embedding_provider": {
      "status": "healthy",
      "latency_ms": 42
    }
  },
  "version": "1.4.2",
  "uptime_seconds": 84721
}

The overall status is degraded if any check is degraded, unhealthy if any check is unhealthy. A load balancer can use degraded as a signal to stop routing new requests while existing ones drain.

The check includes latency. A dependency that responds in 5 seconds is not "healthy" — it's causing every request to be 5 seconds slower. The health check surfaces that.

VII – What Breaks First

Actor spoofing in audit records. An integration that sets the actor from request body data has audit events that can be manipulated by any caller. This means your audit log is not actually auditable — it's unverified claims. Audit integrity requires server-side actor resolution, always.

Logs with missing context fields. A service is deployed. A bug occurs. The logs are searched. The relevant log entries don't have request_id or project_id. They can't be correlated with anything. The incident takes four hours longer to resolve than it should. Fix: define the required field set for every log level, and add a lint check that fails if a log statement is missing required fields.

Health checks that mask partial outages. The health endpoint checks database connectivity but not vector index performance. Vector queries start degrading at 2 second latency. The health check stays green. The load balancer keeps routing traffic. Customers experience 2 second latency on recalls. The SLO is being violated. Nobody knows. Fix: health checks must cover all components that affect user-facing SLOs.

Audit Event Catalog

Event Type Trigger Actor Source
memory.created POST /memories ctx.Identity
memory.deleted DELETE /memories/:id ctx.Identity
key.created POST /keys ctx.Identity
key.revoked DELETE /keys/:id ctx.Identity
project.accessed Any authenticated request ctx.Identity
auth.failed Failed authentication IP + attempted key prefix

Forensics Checklist

When an incident requires investigation:

  1. Identify the time window from the first anomaly signal
  2. Query audit events by occurred_at in the time window
  3. Filter by relevant target_id or actor_id
  4. For each audit event, query application logs by request_id
  5. Reconstruct the timeline from the combined events
  6. Identify the first event that was unexpected
  7. Trace backward from that event to understand what caused it

The quality of this reconstruction is entirely determined by the quality of the data model you built before the incident happened. Build it as if you'll need it. You will.

0 comments

Join the conversation

Enjoyed this? Subscribe for more.

Get new essays on software architecture, AI systems, and engineering craft delivered to your inbox. No spam-ever.