Pages

September 7, 2014

Learnings From A Diffing Tool

TL;DR - Source code and screenshots.

Motivation

This post is about the things I learned in the course of developing a folder diffing tool I wrote recently, as a pastime project. Couple of weeks back I was wondering if there was any tool that I could use to diff two folders to find duplicate files and delete those from either folder. I wanted to see if I had any duplicates in the files that I copied from my friend. So, I started thinking about a diffing tool which offers these capabilities. I could think of BeyondCompare and few other diffing tools but off the top of my head, couldn't think of any tool that could diff folders and help me delete the duplicates from either of the folders.

I said to myself that this is the perfect opportunity for a project! So, I went home that day and immediately listed down my requirements for the diffing tool. I decided to call it 'FDiffDelete' - Folder Diff and Delete. The initial set of requirements that I came up with was:

  1. Compare two folders(a left folder and a right folder), highlight the common/duplicate files(name and size match - later, use hash).
  2. Give choice to user to delete the common/duplicate files from any folder.
  3. Ver1.0: Only single level files in a folder. Later, recursive compare.
  4. UI stuff: Has a dialog box. The box is split into two. Each has a text area to specify the folder path and a list view to display the file list in the specified folder(s). There is also a delete button in each half of the dialog box that enables the user to delete selected files(or all duplicate files) from either of the two folders.
I later realized that I should compare files based on their Last Modified Date as well, so I added that to the requirements list.
  • Duplicate files determined based on name, size, last-modified-date (later, or by using the file hash(SHA1)).

Nitty gritty design details

With the requirements set, I started designing the tool. I needed a data structure to actually hold information about the files in a folder and to also allow comparisons between files from the two folders. The very first thing that came to mind was a hash table to store file information of all files in the folder.

Pros and cons of hash table:

  • + Fast inserts and lookups.
  • + Pretty straight-forward implementation in C. There is no need for complicated stuff like rebalancing in case of a binary tree.
  • + I already had a hash table implementation in my CHelpLib project :-) .

  • - Not very space efficient when size of hash table is much larger than the actual number of items in the table, i.e., when it is sparse.
  • - Iterating through the hash table wastes CPU cycles when it is sparse. You have to traverse all slots in the table to find all that have valid entries.
  • - Efficiency of a hash table is only as good as it's hash function. A bad one will introduce too many key conflicts and will lead to bad performance.

One other data structure to think was a tree. This will offer average insert and lookup times. It is more space efficient compared to a hash table. However, there is also the implementation complexity due to need for re-balancing a binary tree on inserts and also I did not have a binary-tree implementation in C. So I finally decided that the hash table is the way to go. So, the hash table would be (key, value) = (filename, fileinfo-struct). This hash table would be associated with a folder object(DIRINFO) to hold all files within the folder.

The folder info(DIRINFO) and file info(FILEINFO) structures used in the code:

// Structure to hold the files that have same name within
// the same directory tree. This is required because the hashtable
// in DIRINFO uses filename as key, which means there will be name
// conflicts.
typedef struct _dupWithin
{
    int nCurFiles;
    int nCurSize;
    PFILEINFO *apFiles;
}DUPFILES_WITHIN, *PDUPFILES_WITHIN;

typedef struct _DirectoryInfo
{
    WCHAR pszPath[MAX_PATH];
    int nDirs;
    int nFiles;

    // Was hash comapre used when this DIRINFO was built?
    BOOL fHashCompare;

    // Use a hashtable to store file list.
    // Key is the filename, value is a FILEINFO structure - if hash compare is turned OFF
    // Key is hash string, value is a PCHL_LLIST that is the
    // linked list of files with the same hash string - if hash compare is turned ON
    CHL_HTABLE *phtFiles;

    // List of FILEINFO of files that have the same name in 
    // the same dir tree. - if hash compare is turned OFF
    DUPFILES_WITHIN stDupFilesInTree;

}DIRINFO, *PDIRINFO;

// Structure to hold information about a file
typedef struct _FileInfo
{
    BOOL fIsDirectory;
    BOOL fAccessDenied;

    // This structure must know about the duplicacy of a file
    // because otherwise the directory must hold an additional
    // list of duplicate files.
    BYTE bDupInfo;
    
    LARGE_INTEGER llFilesize;
    BYTE abHash[HASHLEN_SHA1];
    SYSTEMTIME stModifiedTime;
    FILETIME ftModifiedTime;

    WCHAR szPath[MAX_PATH];
    WCHAR szFilename[MAX_PATH];

}FILEINFO, *PFILEINFO;

And so, I started to write the code and reached a good first version which was able to find duplicates based on the file's name, size and last-modified-date. I was even able to use it to find duplicates and delete them. Great! The tool served it's initial purpose. Now I wanted to implement recursive diff, i.e., diff a full folder tree and not just one level. Traversing the folder tree is easy, a BFS using a queue is quite trivial once you have a queue data structure. I did not have a queue implementation but I did have a linked-list implementation. I just added the Queue implementation to my CHelpLib project using linked list as the underlying data structure. Now I had to tackle the problem of how to store the files' information when recursively traversing the folder tree. Two options came to my mind:

  1. Use a tree of folder-info structs to hold information of all files in each folder in the tree starting from the specified folder as the root. This brings the complication of finding duplicate files within two trees. I vaguely remembered a sub-graph compare problem from my algorithms class. But wait, this wasn't even a sub-graph compare problem. It was more trivial than that - how the heck do I search for a file by name among the two folders with an algorithm better than O(n*n). Remember that a duplicate file could be in any folder in the tree rooted at the specified folder.
  2. Use only one hash table, as described earlier, to store the files' information from all folders in the tree. This offers the same fast inserts and is trivial to find duplicate files since the file name is a key.
I chose to use the second option of using hash table to store file info. So, files from all folders in the tree will be in a single hash table. Started writing the code and soon realized another issue to tackle - there could be files with same name in one folder tree. This was a common occurrence, for example, music files that are named as "Track 1", "Track 2". As we all know, hash table cannot handle the duplicate keys, it will just overwrite or fail to insert when a key already exists, depending upon the implementation. So, I had come to up with a way to store duplicate file names within a folder tree. I chose to put the duplicate files in a separate array of file-information structs. This worked out well. Yes it is slow, O(1) insert and O(n) worst case lookup but I do not expect too many duplicate file names in the folder tree as the most common scenario. This worked well and it was pretty fast too. I was able to diff the System32 folder with ~2k files in under 2 seconds on my Core i7 laptop.

Can I Haz Moar Features?

Next feature to add was file comparison using the hash of the file content(file-digest). I decided to use SHA1 as my hashing function since MD5 has more probability of hash collisions. Although, a SHA1 hash collision has also been found and I should probably use SHA2? I wanted to balance speed and collision-resistance. SHA1 is slightly faster than SHA2, however small the difference. I know, I know, I didn't put too much effort into thinking out this part. I can change the hash function easily enough. Anyway, coming back to the feature. In order to find duplicate files using their hashes, the file names no longer matter; only the file digest matters. The file name cannot be used as a key to the hash table in order to find duplicate files. The key to the hash table has to be 160bit(20byte) hash value of a file represented as a string. This brings another issue - files within same hash(duplicate files) within the same folder tree. I solved this by having a list of file-info structs as the value of the hash table entry. So, the hash table of the hash-comparison feature was (key, value) = (hash-value-as-string, list-of-file-info-structs).

By this time, the code had gotten quite complicated mainly due to the two very different ways of using the hash table to store files' information. One was using the file name as key and the other used the file's hash-value. So I decided to separate the folder traversal for the usual comparison and hash comparison features. This made is much easier to maintain and read the code.

Lessons learned

  • Data structures used can make or break a program. They can either make a program fast or painfully slow, a memory hogger or space efficient, easy to maintain vs. PITA spaghetti code. Choose wisely.
  • Start with a initial set of requirements and design accordingly. As you add more requirements, remember to go back to the design board since you may have to rethink the whole design(worst-case) when new requirements are added.
  • Object oriented concepts are your friend in making any challenging problem easy to solve and implement in code. Encapsulation and abstraction are very powerful and quite sufficient. You do not necessarily have to use polymorphism and inheritance all the time.

Source code is on github at: https://github.com/shishir993/fdiffdelete. Built using Microsoft VS2010 Professional and Windows SDK 8.100.26837. Dependent on my other C library CHelpLib.

Some screenshots of the tool at work: