Dupemerge

Authored by Zygo Blaxell

Background

Filesystems often contain identical copies of files. Unix filesystems can reduce storage space required by such redundancy by replacing these identical copies with hard-links to a single file.

Here are four scripts which all do more or less the same thing: scan a filesystem for identical files and replace them with hardlinks to the same file.

Issues

There are some issues which all programs of this type must deal with.

File Attributes

What if two files have the same content, but different ownership or access rights?

I used two solutions to this problem: one was to calculate ownership and access rights that make sense in the event that the files were hardlinked (usually the union of read and execute permissions and the intersection (possibly with the null set) of write permissions). A later solution was to simply consider the ownership and access rights in addition to the contents of the file when deciding whether two files were identical, thereby avoiding hardlinking files with distinct access rights or ownership in the first place.

Timestamps

What if two files have the same content but different timestamps?

This was never a problem for me until I started using rsync. A major performance optimization in rsync is to ignore files which have the same size and timestamp at both ends of the rsync transfer. If the file timestamps change on the receiver, the sender will transmit the file again, and break the hardlink in the process.

My solution to this problem is to consider the timestamp as well as the contents of the file when deciding whether two files were identical.

Modification

A major issue to consider is that a write, mmap, or truncate on one of many identical copies of a file will modify only the file that was written to. If the identical copies are replaced with a single hardlink, then there are no longer several independent copies, and one of these system calls will modify a single file shared among many hardlinks.

In short, if this might happen, you can't hardlink your duplicate files. Examples of such programs are most Unix programs that write directly to files, e.g. most text editors, cp, dd, and so on.

You can use programs which modify files by creating new ones and renaming them over the old ones. This will modify directory entries pointing to the files, not the files themselves. Examples of such programs are version control systems, rsync, and several GNU utilities with certain option flags such as cp --remove-destination or cpio -u. Beware! cvs used to work this way, but newer versions do not!

There are special cases, such as CD-ROM images, where modifications are irrlevant.

Algorithm

The duplicate merging algorithm has worst case O(n^2). In practice, different implementations have different performance determined by their data structures and constant terms in the run time.

Some of the implementations try to merge large files first, which means no space saving for the first part of execution, then rapid space saving, then diminishing space saving until the program terminates. I refer to this as the "Big Bang Style."

Other implementations attempt to merge files as they are discovered, which results in continuous space saving at unpredictable rates throughout execution, depending on which files are encountered. I refer to this as the "Slow Trickle Style."

Hashing

Hashing files (to convert O(n^2) file compares into an O(n) compares by exploiting a strong hash algorithm) turns out to not be such a good idea when you have many files that are large and of identical size, but differ within the first few blocks. This is common when files are prepared for tasks such as DVD authoring--30% of the files are exactly the same size, and that size is about 1GB, but differences can be established by comparing only a handful of bytes at the beginning of each file, eliminating the need to read each file in its entirety.

One of the implementations allows you to turn off hashing (so you only do byte-by-byte comparison) or turn off comparison (so you only do hashing). Disabling both at the same time is probably not a good idea.

Implementations

In my humble opinion the last implementation (faster-dupemerge) is the best one. The rest are here to document some of the strategies that failed.

dupescan

Date
1994-1997
Scalability
10K files
Implementation
Perl + find + md5sum
Persistent Storage
Flat text file
Temporary Storage
Perl hash and arrays
Style
Big Bang
Strengths
Weaknesses

dupemerge

Date
2001
Scalability
100K files
Implementation
Perl (Fcntl) + find + md5sum
Persistent Storage
Hardlink tree inside filesystem
Temporary Storage
Perl hash (dev+inode)
Style
Slow Trickle
Strengths
Weaknesses

fast-dupemerge

Date
2001-2002
Scalability
1M files
Implementation
Tcl + Berkeley DB + cmp + ln + md5sum
Persistent Storage
Berkeley DB environment
Temporary Storage
Small Tcl arrays
Style
Slow Trickle
Strengths
Weaknesses

faster-dupemerge

Date
2002-2010
Scalability
1000M files
Implementation
Perl (File::Compare, File::Temp, Fcntl, Digest::SHA1 or external md5sum program) + sort + find
Persistent Storage
None
Temporary Storage
Whatever sort needs (usually /tmp), small Perl arrays
Style
Big Bang
git repo
git://git.hungrycats.org/~zblaxell/projects/dupemerge
Strengths
Weaknesses