Dupemerge
Authored by Zygo Blaxell
Background
Filesystems often contain identical copies of files. Unix filesystems
can reduce storage space required by such redundancy by replacing
these identical copies with hard-links to a single file.
Here are four scripts which all do more or less the same thing: scan
a filesystem for identical files and replace them with hardlinks to
the same file.
Issues
There are some issues which all programs of this type must deal with.
File Attributes
What if two files have the same content, but different ownership or
access rights?
I used two solutions to this problem: one was to calculate ownership and
access rights that make sense in the event that the files were hardlinked
(usually the union of read and execute permissions and the intersection
(possibly with the null set) of write permissions). A later solution
was to simply consider the ownership and access rights in addition to the
contents of the file when deciding whether two files were identical,
thereby avoiding hardlinking files with distinct access rights or
ownership in the first place.
Timestamps
What if two files have the same content but different timestamps?
This was never a problem for me until I started using rsync.
A major performance optimization in rsync is to ignore files which have
the same size and timestamp at both ends of the rsync transfer.
If the file timestamps change on the receiver, the sender will transmit
the file again, and break the hardlink in the process.
My solution to this problem is to consider the timestamp as well as the
contents of the file when deciding whether two files were identical.
Modification
A major issue to consider is that a write, mmap, or
truncate on one of many identical copies of a file will modify
only the file that was written to. If the identical copies are replaced
with a single hardlink, then there are no longer several independent
copies, and one of these system calls will modify a single file shared
among many hardlinks.
In short, if this might happen, you can't hardlink your duplicate files.
Examples of such programs are most Unix programs that write directly to
files, e.g. most text editors, cp, dd, and so on.
You can use programs which modify files by creating new ones and renaming
them over the old ones. This will modify directory entries pointing
to the files, not the files themselves. Examples of such programs are
version control systems, rsync, and several GNU utilities with
certain option flags such as cp --remove-destination or cpio
-u.
Beware! cvs used to work this way, but newer versions
do not!
There are special cases, such as CD-ROM images, where modifications are
irrlevant.
Algorithm
The duplicate merging algorithm has worst case O(n^2). In
practice, different implementations have different performance determined
by their data structures and constant terms in the run time.
Some of the implementations try to merge large files first, which means
no space saving for the first part of execution, then rapid space saving,
then diminishing space saving until the program terminates. I refer to
this as the "Big Bang Style."
Other implementations attempt to merge files as they are discovered,
which results in continuous space saving at unpredictable rates throughout
execution, depending on which files are encountered. I refer to this
as the "Slow Trickle Style."
Hashing
Hashing files (to convert O(n^2) file compares into an
O(n) compares by exploiting a strong hash algorithm) turns out
to not be such a good idea when you have many files that are large and
of identical size, but differ within the first few blocks. This is
common when files are prepared for tasks such as DVD authoring--30% of
the files are exactly the same size, and that size is about 1GB, but
differences can be established by comparing only a handful of bytes at
the beginning of each file, eliminating the need to read each file in
its entirety.
One of the implementations allows you to turn off hashing (so you only do
byte-by-byte comparison) or turn off comparison (so you only do hashing).
Disabling both at the same time is probably not a good idea.
Implementations
In my humble opinion the last implementation
(faster-dupemerge) is the best one.
The rest are here to document some of the strategies that failed.
- Date
- 1994-1997
- Scalability
- 10K files
- Implementation
- Perl + find + md5sum
- Persistent Storage
- Flat text file
- Temporary Storage
- Perl hash and arrays
- Style
- Big Bang
- Strengths
- Uses straight perl (maybe even just perl4) and shell, no
modules required (not even core modules)
- Persistent storage of md5 hashes in a flat text file
- Tries to sensibly handle clobbered ownership and permissions
- Avoids unnecessary hashing of files by selecting based on size first
- Access to find command line
- Weaknesses
- Limited by usage of RAM (entire file database must fit in RAM before processing)
- Persistent storage in a flat text file
- Persistent storage inside the filesystem to be merged
- Links created at top level directory and renamed to destination directory (this breaks on the Linux MS-DOS/VFAT filesystem implementation)
- Clobbers timestamps, ownership, permissions
- No concurrent access to persistent storage
- Persistent storage accessed atomically (no incremental update or useful concurrent access)
- Date
- 2001
- Scalability
- 100K files
- Implementation
- Perl (Fcntl) + find + md5sum
- Persistent Storage
- Hardlink tree inside filesystem
- Temporary Storage
- Perl hash (dev+inode)
- Style
- Slow Trickle
- Strengths
- Uses modern Perl and shell, no non-core modules required
- Incremental use of persistent storage
- Concurrent access to persistent storage
- Weaknesses
- Limited by usage of RAM
- Persistent storage inside the filesystem to be merged
- Requires free space in the target filesystem
- Hashes all files, even files that are unique
- Cannot cope with hash collisions
- Makes hardlinks to all files, whether they will ultimately be merged or not (breaks NFS)
- Leaves hardlinks after original files unlinked (can't delete anything)
- Performance dependent on filesystem (ext3 is very slow)
- Links created across directories (this breaks on the Linux MS-DOS/VFAT filesystem implementation)
- Date
- 2001-2002
- Scalability
- 1M files
- Implementation
- Tcl + Berkeley DB + cmp + ln + md5sum
- Persistent Storage
- Berkeley DB environment
- Temporary Storage
- Small Tcl arrays
- Style
- Slow Trickle
- Strengths
- Persistent storage outside of hardlinked filesystem
- Incremental persistent storage
- Concurrent access to persistent storage
- Avoids unnecessary hashing of files until two files of the same size seen
- Low RAM requirement
- Weaknesses
- Requires disk space (but trivially modified to use only RAM)
- All Berkeley DB errors must be handled by killing all processes accessing the DB environment, running recovery, and restarting
- Accesses permanent disk space randomly - most I/O time is spent in Berkeley DB, not on merged files
- Relies on ln -vf (and has the same race condition)
- No find command line or equivalent options
- Contrary to its name, this is the slowest of the implementations
- Date
- 2002-2010
- Scalability
- 1000M files
- Implementation
- Perl (File::Compare, File::Temp, Fcntl, Digest::SHA1 or external md5sum program) + sort + find
- Persistent Storage
- None
- Temporary Storage
- Whatever sort needs (usually /tmp), small Perl arrays
- Style
- Big Bang
- git repo
- git://git.hungrycats.org/~zblaxell/projects/dupemerge
- Strengths
- Low RAM requirement
- Avoids unnecessary hashing of files until two files of the same size seen
- Supports compare-by-hash and compare-byte-by-byte modes
- Space and time-efficient sorting of files before processing
- Full access to find and sort program command line arguments for high flexibility
- Built-in fcntl locking to exclude concurrent access (this was required by one of the applications...)
- Uses no external programs per file if all Perl modules installed, but can use an external md5sum program
- Uses stronger SHA1 hash by default
- Copes better (but still not very well) with files in non-writable directories
- Weaknesses
- Limited by sort program (fortunately the limit is very large)
- Needs disk space for large sets of files (used by sort)
- No persistent storage
- Less portable (needs GNU find, GNU sort, many core Perl modules)