Dupemerge

Background

Filesystems often contain identical copies of files. Unix filesystems can reduce storage space required by such redundancy by replacing these identical copies with hard-links to a single file.

Here are four scripts which all do more or less the same thing: scan a filesystem for identical files and replace them with hardlinks to the same file.

Issues

There are some issues which all programs of this type must deal with.

File Attributes

What if two files have the same content, but different ownership or access rights?

I used two solutions to this problem: one was to calculate ownership and access rights that make sense in the event that the files were hardlinked (usually the union of read and execute permissions and the intersection (possibly with the null set) of write permissions). A later solution was to simply consider the ownership and access rights in addition to the contents of the file when deciding whether two files were identical, thereby avoiding hardlinking files with distinct access rights or ownership in the first place.

Timestamps

What if two files have the same content but different timestamps?

This was never a problem for me until I started using rsync. A major performance optimization in rsync is to ignore files which have the same size and timestamp at both ends of the rsync transfer. If the file timestamps change on the receiver, the sender will transmit the file again, and break the hardlink in the process.

My solution to this problem is to consider the timestamp as well as the contents of the file when deciding whether two files were identical.

Modification

A major issue to consider is that a write, mmap, or truncate on one of many identical copies of a file will modify only the file that was written to. If the identical copies are replaced with a single hardlink, then there are no longer several independent copies, and one of these system calls will modify a single file shared among many hardlinks.

In short, if this might happen, you can't hardlink your duplicate files. Examples of such programs are most Unix programs that write directly to files, e.g. most text editors, cp, dd, and so on.

You can use programs which modify files by creating new ones and renaming them over the old ones. This will modify directory entries pointing to the files, not the files themselves. Examples of such programs are version control systems, rsync, and several GNU utilities with certain option flags such as cp --remove-destination or cpio -u. Beware! cvs used to work this way, but newer versions do not!

There are special cases, such as CD-ROM images, where modifications are irrlevant.

Algorithm

The duplicate merging algorithm has worst case O(n^2). In practice, different implementations have different performance determined by their data structures and constant terms in the run time.

Some of the implementations try to merge large files first, which means no space saving for the first part of execution, then rapid space saving, then diminishing space saving until the program terminates. I refer to this as the "Big Bang Style."

Other implementations attempt to merge files as they are discovered, which results in continuous space saving at unpredictable rates throughout execution, depending on which files are encountered. I refer to this as the "Slow Trickle Style."

Hashing

Hashing files (to convert O(n^2) file compares into an O(n) compares by exploiting a strong hash algorithm) turns out to not be such a good idea when you have many files that are large and of identical size, but differ within the first few blocks. This is common when files are prepared for tasks such as DVD authoring--30% of the files are exactly the same size, and that size is about 1GB, but differences can be established by comparing only a handful of bytes at the beginning of each file, eliminating the need to read each file in its entirety.

One of the implementations allows you to turn off hashing (so you only do byte-by-byte comparison) or turn off comparison (so you only do hashing). Disabling both at the same time is probably not a good idea.

Implementations

In my humble opinion the last implementation (faster-dupemerge) is the best one. The rest are here to document some of the strategies that failed.

dupescan

Date

1994-1997

Scalability

10K files

Implementation

Perl + find + md5sum

Persistent Storage

Flat text file

Temporary Storage

Perl hash and arrays

Style

Big Bang

Strengths

Uses straight perl (maybe even just perl4) and shell, no modules required (not even core modules)
Persistent storage of md5 hashes in a flat text file
Tries to sensibly handle clobbered ownership and permissions
Avoids unnecessary hashing of files by selecting based on size first
Access to find command line

Weaknesses

Limited by usage of RAM (entire file database must fit in RAM before processing)
Persistent storage in a flat text file
Persistent storage inside the filesystem to be merged
Links created at top level directory and renamed to destination directory (this breaks on the Linux MS-DOS/VFAT filesystem implementation)
Clobbers timestamps, ownership, permissions
No concurrent access to persistent storage
Persistent storage accessed atomically (no incremental update or useful concurrent access)

dupemerge

Date

2001

Scalability

100K files

Implementation

Perl (Fcntl) + find + md5sum

Persistent Storage

Hardlink tree inside filesystem

Temporary Storage

Perl hash (dev+inode)

Style

Slow Trickle

Strengths

Uses modern Perl and shell, no non-core modules required
Incremental use of persistent storage
Concurrent access to persistent storage

Weaknesses

Limited by usage of RAM
Persistent storage inside the filesystem to be merged
Requires free space in the target filesystem
Hashes all files, even files that are unique
Cannot cope with hash collisions
Makes hardlinks to all files, whether they will ultimately be merged or not (breaks NFS)
Leaves hardlinks after original files unlinked (can't delete anything)
Performance dependent on filesystem (ext3 is very slow)
Links created across directories (this breaks on the Linux MS-DOS/VFAT filesystem implementation)

fast-dupemerge

Date

2001-2002

Scalability

1M files

Implementation

Tcl + Berkeley DB + cmp + ln + md5sum

Persistent Storage

Berkeley DB environment

Temporary Storage

Small Tcl arrays

Style

Slow Trickle

Strengths

Persistent storage outside of hardlinked filesystem
Incremental persistent storage
Concurrent access to persistent storage
Avoids unnecessary hashing of files until two files of the same size seen
Low RAM requirement

Weaknesses

Requires disk space (but trivially modified to use only RAM)
All Berkeley DB errors must be handled by killing all processes accessing the DB environment, running recovery, and restarting
Accesses permanent disk space randomly - most I/O time is spent in Berkeley DB, not on merged files
Relies on ln -vf (and has the same race condition)
No find command line or equivalent options
Contrary to its name, this is the slowest of the implementations

faster-dupemerge

Date

2002-2010

Scalability

1000M files

Implementation

Perl (File::Compare, File::Temp, Fcntl, Digest::SHA1 or external md5sum program) + sort + find

Persistent Storage

None

Temporary Storage

Whatever sort needs (usually /tmp), small Perl arrays

Style

Big Bang

git repo

git://git.hungrycats.org/~zblaxell/projects/dupemerge

Strengths

Low RAM requirement
Avoids unnecessary hashing of files until two files of the same size seen
Supports compare-by-hash and compare-byte-by-byte modes
Space and time-efficient sorting of files before processing
Full access to find and sort program command line arguments for high flexibility
Built-in fcntl locking to exclude concurrent access (this was required by one of the applications...)
Uses no external programs per file if all Perl modules installed, but can use an external md5sum program
Uses stronger SHA1 hash by default
Copes better (but still not very well) with files in non-writable directories

Weaknesses

Limited by sort program (fortunately the limit is very large)
Needs disk space for large sets of files (used by sort)
No persistent storage
Less portable (needs GNU find, GNU sort, many core Perl modules)