Dupmerge and Rmdups
Phil Karn, KA9Q

These two command-line utilities are designed to help you clean out a file system that has become cluttered with duplicate files. If you're like me, you often make ad-hoc backups of a directory hierarchy with tar, rsync or cpio that tend to sit around forever. Or you keep much the same data on more than one system (with one typically being a laptop), and occasionally you want to consolidate the two collections.

These utilities can help you do that. There are two: dupmerge, which reclaims the space used by redundant copies, and rmdups which eliminates redundant path names pointing to a common file.


dupmerge man page
dupmerge.c source code
dupmerge binary for OSX Snow Leopard (64 bit)
dupmerge binary for i386 Linux
rmdups man page
rmdups.c source code
rmdups binary for OSX Snow Leopard (64 bit)
rmdups binary for i386 Linux

Unix Hard Links

To use either command properly you must be familiar with the concept of the hard link, so I'll provide a quick refresher. It originated in UNIX, so it has long been part of UNIX clones like Linux, BSD, and Mac OSX. It may have also been picked up by other operating systems with which I am unfamiliar.

It might not seem like it, but UNIX files do not actually have names; internally they are uniquely identified by device and inode numbers. The user refers to them with path names, printable strings that define a series of directories delimited by slashes (/) -- a path -- that ultimately point to the inode and the file's metadata and contents. Every file (inode) has at least one path name pointing to it. When more than one path name points to the same inode, they are all equally valid; there is no preferred or primary path name. These path names are sometimes referred to as "hard links", as distinguished from the "soft" or "symbolic" link, a special kind of file that contains a path name (possibly one of several hard links!) pointing to another file.

The dupmerge command works by reading a list of path names, identifying those that point to distinct inodes with the same content, deleting all but one, and recreating the path names for the deleted copies as hard links to the remaining copy. When it finishes, all the path names given to the program still exist, and each provides the same content when read. However, the redundant copies have been deleted and the space reclaimed by the filesystem free list.

Here is a simple shell command that invokes dupmerge on all the files in the current directory and under any and all subdirectories:

find . -print0 | dupmerge -0

dupmerge automatically ignores all directories, special device files, FIFOs -- everything that isn't an ordinary file -- so you don't have to bother with any fancy arguments to the find command. When it's done, every path name will still be present and yield the same contents when accessed. But any path names that had referred to separate copies of the same data are now hard links to a single copy.

Running dupmerge can still make some visible differences and it is important to understand them before you use the program on an important file system because there is, as yet, no way to reverse the effects of dupmerge. In UNIX, a file's metadata belongs to the inode, not the path name. The metadata includes the ownership and group membership of the file, its access permissions, and the times of its creation and (if enabled) last access and inode modification. Since only one inode in each group of inodes with the same contents will survive an execution of dupmerge, file ownerships, permissions and modification timestamps may change for certain pathnames even though the contents remain unchanged. The metadata on the deleted copies of the file is currently lost, though a future version of dupmerge may save this data in an "undo" file so that its effects could be undone. Until then, it is best to limit dupmerge to directory hierarchies whose files have a single owner and group, and to anticipate any problems that might be caused by the loss of metadata on the redundant copies that will be deleted. Dupmerge always keeps the oldest copy among a group of identical files on the principle that the oldest timestamp is most likely the actual time that the data was created, with the newer timestamps merely reflecting the time that each copy was made from the original.

Because empty files are often used as system lock files, with significance in the inode metadata, dupmerge ignores them by default. (There's little to reclaim by deleting an empty file anyway.)

Dupmerge Options

-0

By default, each file name is terminated with either a newline or a null. This option specifies that each file name can only be terminated with a null. (Use find's -print0 option to generate such a list.)

-z

By default, empty files are ignored. They are often used as locks, and disturbing them could cause problems. Besides, empty files have no assigned blocks so there's not much to be gained by deleting them (other than the inode entry, of course). This option overrides this behavior; when set, all empty files will be linked together.

-s

By default, the list of files is sorted by decreasing size and the largest files are examined first.

I often run dupmerge when a file system is out of (or is about to run out of) free space, so scanning the largest files first usually recovers space more quickly. I can't think of a reason to examine the smallest files first, but if you can this option will do it.

-n

This flag tells dupmerge to conduct a dry run. It will appear to operate normally, but without actually unlinking or relinking any files.

-q

Suppress the progress messages that dupmerge normally generates, including the unlink/link commands generated for each duplicate file. Error messages are still displayed. This flag is forced off by the -n flag.

-f

Enable a shortcut comparison heuristic that can substantially speed up the program under certain special cases. When set, a pair of files with the same basename (the part of the path name after the last '/'), the same size, and the same modification date will be considered identical without actually comparing their contents. This can be an important special case as many duplicates are created by copying a directory hierarchy with a utility like rsync, tar or cpio that preserves modification times. The widely used rsync program uses this heuristic by default so it doesn't seem terribly unsafe, but keep in mind the (small) risk that two different files might be erroneously considered identical. For this reason I recommend avoiding this flag when possible; the default file comparison strategy is actually quite fast.

-t

The probability that two different files will have the same size decreases with increasing file size. To reduce the risk of false matches with the -f option, files less than a certain size will nonetheless be compared by their hash codes even when -f is selected. The default size threshold is 100,000 bytes; this option allows another threshold to be set.

Notes on dupmerge

My first version of this program circa 1993 worked by computing MD5 hashes of every file, sorting the hashes and then looking for duplicates. This worked but it was unnecessarily slow. One reason was that it computed a hash for every file, including those with unique sizes that couldn't possibly have any duplicates (duplicate files always have the same size!)

My second version circa 1999 unlinked the duplicates as a side effect of the sort comparison function.

I have since rewritten it again from scratch. It begins by sorting the file list by size. Files with unique sizes are ignored; two or more files having the same size are compared by their SHA-1 hashes. Comparing hashes, as opposed to actual file contents, has significant performance benefits in most real-world workloads. Each file can be read sequentially, without any disk seeks, and its hash cached so that it need only be read once no matter how many other files have the same size.

But there are nonetheless pathological situations where this algorithm performs much more slowly than direct file-to-file comparison. Consider a collection of a thousand unique files, each exactly 10 megabytes in size. The hash comparison strategy involves reading every file in its entirety while a direct file-to-file comparison can abort as soon as the first difference is detected. Most files, if different, differ near their beginnings.

To handle this possibility with reasonable efficiency, comparing a pair of files actually entails a series of steps. First, both files must have the same size and be on the same file system. (Hard links cannot extend across file systems.) Second, I compute and compare the SHA-1 hash of the first page (4 KB) of each file. This will catch most differing files. If, and only if, the first 4KB of each file have the same hash do I proceed to compute and compare the SHA-1 hash for each entire file. This performs well in practice because most files that differ at all will do so in the first 4KB. It is unusual for two files to have the same size and the same first page, yet differ beyond that. However, it is certainly possible, which is why it is still necessary to compare the complete hashes before declaring two files to be identical.

Every hash result (on the leading page or over the full file) is cached so it never has to be computed more than once. This improves performance substantially when there are many files of the same size.

The program displays copious statistics at the end of execution, such as the number of disk blocks reclaimed from duplicates, the number of first-page and complete-file hash function computations, the number of hash "hits" (references to hash values that have already been computed), and the number of same-size file pairs whose first page hashes match but differ in their full-file hashes.


Rmdups

(Documentation incomplete.) This is a companion to dupmerge that can be used to get rid of unwanted hard links.

Like dupmerge, rmdups also accepts a list of path names from standard input, e.g.,

find . -print0 | rmdups -0

It takes only one argument, -0, that specifies that each input path name is null terminated. I strongly recommend this, along with the -print0 option to find.

Hard links to common files are identified and grouped. If the input includes every hard link to some particular file, they are displayed and the user is prompted as to which he would like to keep; all the others are unlinked. But if the input contains only some of the path names for a given file, those that are cited are automatically unlinked without user intervention.

This feature seems more dangerous than it really is, but it's what makes rmdups useful when cleaning up directories of redundant entries. First of all, rmdups never, under any circumstances, unlinks the last remaining path name to a file (which is what actually deletes a file in a UNIX-like file system.)

Consider two directories, 'work' and 'tmp'. Some time ago you created 'tmp' as a snapshot of 'work', and now you'd like to clean things up and merge everything back into the 'work' directory. You have already executed

find work tmp -print0 | dupmerge -0

so they contain links to a more or less common set of files. Now you'd like to get rid of the 'tmp' directory, but the latter also has a few unique files that you'd like to keep so you don't want to issue a simple rm -rf tmp. The command find tmp -print0 | rmdups -0 automatically clears every path name under 'tmp' that points to a file also referred to from 'work', leaving any singly-linked files under 'tmp' untouched so you can go through them by hand.

Consider this sequence:

find . -print0 | dupmerge
find . -print0 | rmdups

This sequence first identifies duplicate files and reclaims the space in the redundant copies. Then it lists all the path names that point to each file, prompting the user to select one to keep.

rmdups is much less mature than dupmerge, and it needs more work. Right now, for example, there is no way to specify that all of the path names for a file should be kept without simply quitting the program.

Both dupmerge and rmdups, by their nature, are programs that can potentially do a lot of damage to a file system should there prove to be any bugs. With this in mind, I've included a lot of rather paranoid assert checks that will cause the program to abort should something not be as expected. I have used dupmerge for quite some time without any problems, but I wouldn't bet my life on their being none. Should you run into any bugs, please let me know ASAP so I can fix them.

Phil Karn, KA9Q, [email protected]

Last updated: 7 May 2010