Deduplication

Content-Defined Deduplication

DodaZIP’s killer feature is content-defined deduplication — the ability to store identical chunks of data only once, even if they appear in different files.

How It Works

DodaZIP uses FastCDC (Fast Content-Defined Chunking), a rolling hash algorithm that splits files into variable-sized chunks at content-defined boundaries. This means:

Chunk boundaries depend on content, not file offsets. Adding a byte at the start of a file only shifts the first chunk boundary — the rest of the chunks remain identical.
Identical chunks are detected across files. If two files contain the same data (e.g., shared libraries, repeated log entries, duplicate blocks in VM images), those chunks are stored once.
Sub-file deduplication. Unlike file-level dedup, FastCDC catches redundancy even within a single file (e.g., repeated sections, embedded resources).

The Chunk Table

The chunk table is a hash map from BLAKE3 hash → chunk metadata:

{
  "chunks": [
    {
      "hash": "b3a1f2...",
      "offset": 4096,
      "compressed_size": 2048,
      "uncompressed_size": 8192,
      "codec": "zstd",
      "refcount": 3
    }
  ],
  "files": [
    {
      "path": "backup-2024-01-01.sql",
      "chunks": ["b3a1f2...", "c7d8e9..."]
    }
  ]
}

When Dedup Shines

Scenario	Typical Savings
VM images (multiple snapshots)	80–95%
Source code repositories	50–70%
Database backups (daily dumps)	90–99%
Log files with repeated entries	60–80%
Mixed binary/text collections	30–50%

Limitations

Deduplication is memory-intensive for large archives. The chunk table must fit in RAM during compression. For archives with billions of chunks, DodaZIP provides a streaming mode with bounded memory usage.
Dedup does not help with already-compressed data. Compressed files (JPEG, MP4, ZIP) have high entropy and few duplicate chunks. DodaZIP detects this automatically and skips dedup for incompressible content.

Verification

Every chunk is verified by BLAKE3 hash on extraction. If ECC is enabled, corrupted chunks can be repaired automatically.