an archive compare and cleanup tool

What is it ?

When you have some production i.e. letters, programs or any kind of manual generated documents, it is recommended to make backups from time to time, for any imaginable reasons. Even if these backups end up in compressed archive files, they can become cumbersome very quickly, and we are sure that 2 archives contain a great proportion of dupliqued files.

The purpose of this program is to remove the dupliqued files in the older archive by comparing them to files in tne newer archive.

How it works

First, we suppose that the archives are an image of a directory tree, such as generated with the command tar for example, that can easily reproduced by extraction. As we need 2 archives to compare we will choose the 2 adjascent ones in the older to the newer order. And at next compare, the previous newer will become the older, and so on ...

We have now 2 directory trees. We can run te command archcmp. The program scans the 2 directories and, for each, make a list of contained files with file name, file size and md5 sum for each file, a list symbolic links and a list of the contained directories.

After this preparation, each file of the old list is searched in the new list. If a file with file name, file size, md5 sum is found in the new list, then the old file is removed on the disk. Then the old list is scanned again for symbolic link entries. Il the the link is dead, it is removed an disk. To finish the process, directories are scanned for empty directories, if they are empty, they are removed on disk. Each resultant list is printed to files.

Here is an example of command. The -v indicate to make some log of the operarions done. It is up to user to archive with his prefered archive tool.

archcmp -v  svg100114 svg100811 >& log

Size result after running the command.

-rw-rw-r-- 1 hq hq 168636556 Jan 14  2010 svg100114.tar.xz  # before

-rw-rw-r-- 1 hq hq    673324 Sep  8 11:34 svg100114.tar.xz  # after
-rw-rw-r-- 1 hq hq 185471396 Aug 11 09:29 svg100811.tar.xz

Files in resultant archive are files that have been suppressed or changed between the 2 backups.

Warning: the program has removed files but as long as the initial archive is there, nothing is lost.

Compiling Archcmp

We need only to have the libc installed, and next run the following commands.

tar zxvf archcmp-yyyymmdd.tar.gz
cd arch-yyyymmdd

copy the binary in a directory in your path.


The Archcmp software is licensed under the terms of the GNU General Public License as published by the Free Software Foundation. See the file COPYING in the archive.