storagebackendarchitecturecost

You Are Storing the Same File Dozens of Times and Don't Know It

Content-addressable deduplication and integrity checks for API platforms using object storage

February 21, 20247 min read

Every API platform that accepts file or blob uploads has the same problem.

Users upload the same content repeatedly. A document gets re-uploaded with minor formatting changes. An image is attached to multiple records. A shared configuration file gets stored once per project that references it. Each upload is a new object in S3. Each object has a storage cost. The costs are small individually. They compound.

After six months, 30-40% of your stored objects are duplicates. Your storage bill is substantially higher than it needs to be. Worse, if any of those objects need to be updated — say, you discover a security issue in a file type you were storing — you have to find and update every copy, rather than one canonical version.

Content-addressable storage solves this. One canonical copy. One key. Every reference points to it. Deduplication is automatic. Integrity verification is free. Storage costs reflect actual unique content, not upload count.

This post covers how to build it correctly.

I – The Core Idea: Content as Key

Traditional object storage uses user-specified keys: uploads/user_abc/document.pdf. The key is arbitrary. Two identical files get two different keys. No connection between them.

Content-addressable storage uses the content hash as the key: sha256:a3f1b2c3d4e5f6.... Two identical files produce the same hash, and therefore the same key. One storage operation for many uploads.

The hash serves two purposes: deduplication (same hash = same object, no need to store again) and integrity verification (after retrieval, recompute the hash and compare to verify the object hasn't been corrupted or tampered with).

The hash is computed once, at upload time. It is stored in your database as part of the artifact record. It never changes.

II – The Deduplication Flow

The dedup check happens before the upload, not after.

func storeArtifact(content []byte, metadata ArtifactMeta) (ArtifactRecord, error) {
    hash := sha256.Sum256(content)
    key := fmt.Sprintf("sha256:%x", hash)

    // Check if this content already exists
    existing, err := db.GetArtifactByHash(key)
    if err == nil {
        // Already stored. Create a new database record pointing to existing storage object.
        return db.CreateArtifactReference(existing.StorageKey, metadata)
    }

    // New content. Upload to storage.
    if err := storage.Put(key, content); err != nil {
        return ArtifactRecord{}, err
    }

    return db.CreateArtifactRecord(key, len(content), metadata)
}

The flow is: hash the content, check the database, upload only if the hash is new. The database check is fast (indexed lookup on the hash column). The upload is skipped for duplicate content.

The artifact database record stores the storage key, the original filename, the content type, the size in bytes, the hash, and the upload timestamp. The storage key is the hash-based identifier. The original filename is preserved for display purposes only — it does not affect how the object is keyed or retrieved.

III – Race Condition: Double Upload

The obvious race condition: two requests arrive simultaneously with the same content. Both hash the content. Both check the database — neither exists yet. Both upload. Both create database records. You now have two identical objects in storage and two database records pointing to the same hash.

The fix: handle the collision at the database level, not the application level.

The artifact table has a unique constraint on the storage key (the hash). The second insert fails with a unique constraint violation. Catch that error, query the existing record, and proceed.

ALTER TABLE artifacts ADD CONSTRAINT uq_artifacts_storage_key UNIQUE (storage_key);

In application code:

err = db.CreateArtifactRecord(key, size, metadata)
if isUniqueConstraintViolation(err) {
    existing = db.GetArtifactByHash(key)
    return db.CreateArtifactReference(existing.StorageKey, metadata)
}

The object was uploaded twice — there's a brief moment where the storage has a duplicate. The cleanup job (covered below) handles this. The database, however, is consistent: one record per unique hash.

IV – The Metadata Model

The storage layer has two distinct record types: the storage object itself, and the references to it.

-- One record per unique content hash
CREATE TABLE storage_objects (
  id          UUID PRIMARY KEY,
  storage_key TEXT UNIQUE NOT NULL,  -- sha256:abcdef...
  size_bytes  INTEGER NOT NULL,
  sha256_hash TEXT NOT NULL,
  uploaded_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  ref_count   INTEGER NOT NULL DEFAULT 0  -- denormalized reference count
);

-- One record per artifact (multiple can reference same storage_object)
CREATE TABLE artifacts (
  id              UUID PRIMARY KEY,
  storage_key     TEXT NOT NULL REFERENCES storage_objects(storage_key),
  project_id      TEXT NOT NULL,
  original_name   TEXT,
  content_type    TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  deleted_at      TIMESTAMPTZ
);

ref_count tracks how many artifact records point to each storage object. This is a denormalized counter maintained by triggers or application code. When ref_count reaches 0, the storage object is a candidate for deletion.

Never delete a storage object with ref_count > 0. The object is still referenced by at least one artifact. Delete the artifact record first. Then check if the ref_count has reached 0. Then delete the storage object.

V – Lifecycle Policies and Garbage Collection

Artifacts get deleted. Projects get deleted. When all artifact records pointing to a storage object have been deleted, the object in S3 should be deleted too.

This cleanup cannot be done synchronously in the delete request handler — it requires knowing the ref_count at the moment of deletion, which requires a database transaction. The cleanup belongs in a background job.

The orphan cleanup job:

SELECT so.storage_key
FROM storage_objects so
WHERE so.ref_count = 0
  AND so.uploaded_at < NOW() - INTERVAL '1 hour'  -- grace period

The one-hour grace period matters. A storage object with ref_count = 0 might be in the process of having a new artifact reference created (the upload happened, the artifact record hasn't been committed yet due to a transaction in progress). The grace period prevents premature deletion.

For the actual S3 deletion, use lifecycle policies as a secondary mechanism. S3 supports transitioning objects to cheaper storage tiers (Intelligent Tiering, Glacier) after a configurable period. This is not a substitute for application-level cleanup — an object in Glacier still has a storage cost — but it reduces the cost of objects that your cleanup job missed.

VI – Integrity Verification

At rest, objects can be corrupted. S3 has an extremely low corruption rate, but not zero. For content that matters, verify integrity at retrieval time.

The verification flow:

func getArtifact(storageKey string) ([]byte, error) {
    content, err := storage.Get(storageKey)
    if err != nil {
        return nil, err
    }

    // Verify integrity
    hash := fmt.Sprintf("sha256:%x", sha256.Sum256(content))
    if hash != storageKey {
        emit("storage_integrity_failure", fields{"key": storageKey, "expected": storageKey, "actual": hash})
        return nil, errors.New("storage integrity check failed")
    }

    return content, nil
}

The storage key is the hash. Recompute the hash after retrieval. If they don't match, the object has been corrupted or tampered with. Log it. Alert. Do not return the corrupted content.

For performance-sensitive paths, verification can be done asynchronously (retrieve, return to user, verify in background). For security-sensitive content (anything you're making access control decisions based on), verification must be synchronous.

VII – What Breaks First

Duplicate writes due to race conditions. Two concurrent uploads of the same content. If you don't have a unique constraint on the storage key, you get duplicate database records and possibly duplicate storage objects. The unique constraint is not optional — it's the dedup guarantee.

Corrupted metadata links. An artifact record points to a storage key that no longer exists in storage. This happens when: the object was deleted but the artifact record wasn't, or the upload failed partway and the database record was created before confirming successful storage. Fix: always write the storage object before writing the artifact record. Never write the artifact record if the storage upload fails. And periodically run a consistency check: SELECT * FROM artifacts WHERE storage_key NOT IN (SELECT storage_key FROM storage_objects).

Orphan growth over time. Artifact records are deleted. The ref_count update fails silently. The garbage collection job doesn't find the orphaned objects because the ref_count didn't reach 0. After a year, you have tens of thousands of storage objects with ref_count = 1 pointing to soft-deleted artifacts. Fix: ref_count updates should be done in the same database transaction as the artifact deletion. If the transaction rolls back, the ref_count is unchanged.

Storage Key Naming Convention

sha256:{64-character-lowercase-hex-hash}

Example:

sha256:a3b4c5d6e7f8a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a1b2c3d4e5f6a7b8

Use this exact format. The sha256: prefix makes the hash algorithm explicit, making it safe to upgrade to SHA-3 or other algorithms in the future without ambiguity.

Never use the original filename as a storage key. Filenames are not unique, not immutable, and not integrity-verifiable. They belong in your database record as display metadata, not in your storage layer as identifiers.

Content-addressable storage is one of those architectural decisions that gets cheaper over time, not more expensive. Every duplicate upload you avoid is a cost you never pay. Every integrity violation you catch is data corruption you prevented. Build it correctly once.

0 comments

I – The Core Idea: Content as Key #

II – The Deduplication Flow #

III – Race Condition: Double Upload #

IV – The Metadata Model #

V – Lifecycle Policies and Garbage Collection #

VI – Integrity Verification #

VII – What Breaks First #

Storage Key Naming Convention #