Deduplication
Content-Defined Deduplication
DodaZIP’s killer feature is content-defined deduplication — the ability to store identical chunks of data only once, even if they appear in different files.
How It Works
DodaZIP uses FastCDC (Fast Content-Defined Chunking), a rolling hash algorithm that splits files into variable-sized chunks at content-defined boundaries. This means:
- Chunk boundaries depend on content, not file offsets. Adding a byte at the start of a file only shifts the first chunk boundary — the rest of the chunks remain identical.
- Identical chunks are detected across files. If two files contain the same data (e.g., shared libraries, repeated log entries, duplicate blocks in VM images), those chunks are stored once.
- Sub-file deduplication. Unlike file-level dedup, FastCDC catches redundancy even within a single file (e.g., repeated sections, embedded resources).
The Chunk Table
The chunk table is a hash map from BLAKE3 hash → chunk metadata:
{
"chunks": [
{
"hash": "b3a1f2...",
"offset": 4096,
"compressed_size": 2048,
"uncompressed_size": 8192,
"codec": "zstd",
"refcount": 3
}
],
"files": [
{
"path": "backup-2024-01-01.sql",
"chunks": ["b3a1f2...", "c7d8e9..."]
}
]
}When Dedup Shines
| Scenario | Typical Savings |
|---|---|
| VM images (multiple snapshots) | 80–95% |
| Source code repositories | 50–70% |
| Database backups (daily dumps) | 90–99% |
| Log files with repeated entries | 60–80% |
| Mixed binary/text collections | 30–50% |
Limitations
- Deduplication is memory-intensive for large archives. The chunk table must fit in RAM during compression. For archives with billions of chunks, DodaZIP provides a streaming mode with bounded memory usage.
- Dedup does not help with already-compressed data. Compressed files (JPEG, MP4, ZIP) have high entropy and few duplicate chunks. DodaZIP detects this automatically and skips dedup for incompressible content.
Verification
Every chunk is verified by BLAKE3 hash on extraction. If ECC is enabled, corrupted chunks can be repaired automatically.