October 27, 2014

How Big Are Your Functions?

Couple of weeks back, there was a need to find out the code size of functions in a program I had written. I was just curious as to which functions were the biggest space hoggers. By 'code size of functions', I mean the number of bytes a function's machine code occupy in the .text section of the binary file (or in memory when loaded). This does not include the stack/heap space used by the function. Doing this exercise will come in handy when there is a need to reduce your program's memory footprint; in other words you want to achieve space optimization. Of course, optimizing for space includes stack/heap optimization along with code size reductions. Keep in mind that code size does affect a program's memory footprint if you are using a lot of macros and inline functions.

Disassembler to the rescue!

First thing that came to my mind was to use dumpbin to disassemble the binary code and find out the code size of a function by subtracting the function start address and the end address. This works fine but is tedious if you want to measure the size of multiple (or all) functions within a program.

?DestroyDirInfo_NoHash@@YAXPAU_DirectoryInfo@@@Z:
00413560: 55 push ebp
00413561: 8B EC mov ebp,esp
00413563: 81 EC D8 00 00 00 sub esp,0D8h
...
...
00413662: 8B E5 mov esp,ebp
00413664: 5D pop ebp
00413665: C3 ret

This is a snippet from the disassembly of the function DestroyDirInfo_NoHash, one of the functions in FDiffDelete program. So we can calculate the code size as:
0x00413665 - 0x00413560 + 1 = 0x106 bytes (or 262 bytes)

DIA, I've got my eye on you

Sometime back, a colleague mentioned that the Debug Interface Access (DIA) SDK can be used for the same purpose. I didn't get more details from him that day and the topic just slipped my mind until this weekend. I started digging around MSDN to find out how I can use the DIA SDK to find out the functions' code size.

Reading through the articles, I learned that the DIA is an abstraction layer over PDB files and is implemented as an in-proc COM server. A PDB file is the Program DataBase file that holds debug information for a binary file - stuff like local variable address, function address, etc. So I thought - 'Yes, this could give me the code size of functions'. Digging further into the articles, I found that what you need is:

Path to the PDB file corresponding to the exe/dll file whose functions you want to analyze.
(OR) Path to the exe/dll file and optionally a search path where to look for the corresponding PDB file.

With the above information in hand, the process of enumerating all functions in the binary and finding out their size is quite straight-forward. Only hurdle is the usage of COM specific stuff if you aren't familiar with it. So the basic outline of using DIA for our 'code sizing' purpose is shown below:

I've uploaded this code to github here: CodeSizer. It shows the undecorated function names and their size in bytes in descending order of size. It also has options to show all functions or specific functions or functions that match the specified sub-string.

Going back to our earlier example of calculating the size of the function DestroyDirInfo_NoHash, see the output from CodeSizer, the size is 262 bytes:
> CodeSizer.exe
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe
> /s DestroyDirInfo_NoHash
Function Name Size In Bytes
------------- -------------
DestroyDirInfo_NoHash 262
Enumerated #functions = 1

Sizes of all functions that have 'Build' as a sub-string:
> CodeSizer.exe
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe
> /r Build
Function Name Size In Bytes
------------- -------------
BuildFilesInDir_NoHash 2620
BuildFilesInDir_Hash 2350
BuildDirTree_Hash 938
BuildDirTree_NoHash 938
BuildDirInfo 889
BuildFilesInDir 107
BuildDirTree 99
Enumerated #functions = 7

Thus, a manual task becomes an easy automation that can be used frequently in future. This is what we programmers are good at, isn't it? We are lazy, and that's a good thing, in a way!

Peeking Into cryptbase.dll

I was curious to look at the sizes of some system DLL and chose to look into the AES functions in cryptbase. Strange thing I noticed when doing this is I couldn't get function information from this dll (or ntdll.dll either) by using the same technique as for my own DLLs. I had to look for public symbols in the PDB's global scope and then filter in only the functions. My guess is that system DLLs do not have debug information for non-exported functions. Here are the top 10 largest AES functions from cryptbase.dll:

Function Name Size In Bytes

------------- -------------

AesExpandKey 864

AesCbcDecrypt 832

AesCbcEncrypt 752

AesDecrypt 656

AesEncrypt 640

AesCtrRng_Generate 448

AesCtrRng_Instantiate 292

AesCtrRng_Update 292

AesCtrRng_Reseed 216

AesCtr_safe_startup 172

What's Next?

The DIA SDK, although very powerful, is quite tedious to use because it involves a lot of function calls to get stuff out. So, building an abstraction layer on top of it such as in the CodeSizer project will make it simple. Such an abstraction could be used as the DIA front-end in a debugger since the debugger makes heavy use of PDB files in order to show debug information.

October 16, 2014

Interlocked Instructions

In this post, I will discuss the implementation of the InterlockedIncrement API (and its counterpart, the InterlockedDecrement) found in Win32 SDK. I will also talk about how it is used in COM to track object lifetimes. This API is declared in WinBase.h and exported by kernel32.dll as specified in MSDN. This API is one of the many Interlocked functions available in the Win32 SDK. These APIs help in implementation of various synchronization mechanisms such as semaphores.

I first came across InterlockedIncrement when going through a COM application's source code. Every COM class is derived from the IUnknown interface and this interface has AddRef and Release functions. As specified in the MSDN article, these functions are together used to keep track of a COM object's lifetime by the reference counting mechanism. A COM server is designed such that a single server object can be used by more than one client. Whenever a COM client obtains a pointer to a COM server object, a call to AddRef must be made and when the object is no longer needed, a call to Release must be made. This way, the server can keep track of how many references to its object are present and can clean itself up when the reference count goes down to zero. The below snippet of pseudo-code should give you an idea of how this is implemented:

class ComServer : public IUnknown {
UINT nRefs;
ULONG AddRef() { InterlockedIncrement(&nRefs); } // ++nRefs
ULONG Release() { InterlockedDecrement(&nRefs); } // --nRefs
HRESULT QueryInterface(clsid, **ppObject)
{
...
AddRef();
return pointer to ComServer object
}
}

Since COM objects are designed to be used by multiple clients in a multi-threaded (and multi-processor) environment, the incrementing and decrementing of the counter variable (nRefs in above example) must be controlled using some synchronization mechanisms. This is where the InterlockedIncrement and InterlockedDecrement functions come into picture.

Implementation

According to MSDN, InterlockedIncrement "Increments (increases by one) the value of the specified 32-bit variable as an atomic operation." and the InterlockedDecrement function decreases the value of its argument by one. Let us see how these two functions ensure that operations on their arguments are atomic/synchronized. Looking at the assembly code of the function call and implementation of InterlockedIncrement:

Function Call

Function Implementation

The function call is a regular call to a system API. The function implementation is just 5 instructions. The increment is done by a single instruction - the xadd instruction which has a lock prefix. The Intel Developer's Manual Vol2 says about xadd: "This instruction exchanges the first operand(destination) with the second operand(source), then loads the sum of the two values into the destination operand." So ecx contains the address of the memory variable to be incremented and eax contains 1 before the xadd instruction. The xadd instruction does this:

Assert LOCK signal
|
temp = eax + *ecx
|
eax = *ecx
|
*ecx = temp

We can see that even though it is a single assembly instruction, it involves three steps. The LOCK signal assertion is the one that ensures the instruction executes as an atomic operation.

Looking at the same document for the lock prefix: "Asserts LOCK# signal for duration of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted".

What this means is that the when the LOCK signal is asserted, no other processor or bus agent(could be an memory-mapped IO device) can read/write to the main memory. This ensures that the read-write operations of the variable to be incremented is atomic. All other processors and bus agents must wait until this instruction is complete before accessing the main memory. The very simplified diagram of the system architecture should make it easy to understand:

What If The Variable is Cached?

There is a possibility that the memory variable to incremented/decremented is cached in the processor's cache. In this case, the processor's cache coherency mechanism ensures that the operation is atomic. Intel's developers manual Vol3 states that recent processors do not even assert the LOCK signal if the variable is in the cache, instead uses just the cache locking (a technique to achieve cache coherency) mechanism.

A Look At The 64bit Implementation

The assembly code above is from a 32bit build of the source code. I was curious if the 64bit implementation was any different. So, changing the architecture in VisualStudio, compiling the code and I get this:

No function call?? What's going on? I then looked at the WinBase.h header and found that this function is declared with the __forceinline compiler directive. So this explains why there is no function call. Simply, the address of nRefs is loaded into rax and the inc instruction increments the variable. However, the InterlockedIncrement function is supposed to return the updated value as the function return value. Those two instructions don't do that. So I modified the source code to save the return value to a temporary variable to see what the compiler does:

long tmp = InterlockedIncrement(&nRefs);

This time I saw an implementation similar to the 32bit version but is inlined:

How Does Decrement Work?

There is no subtract equivalent of the xadd instruction, so what does the compiler do? As you see below, it is quite clever - it just uses xadd with -1 so essentially it achieves *ecx = *ecx + (-1).

October 7, 2014

So You Want to Write An API?

The topic of this post is the design of an API. These are things I learnt in the course of writing the CHelpLib library. As most programmers know, API stands for Application Programming Interface. An API consists of a set of functions and data structures that provide some functionality to the user of the API. The most common example is the interface offered by an OS. All functions that are part of the OS, exposed to kernel mode applications like device drivers and user mode applications such as MS Word, make up the API of the OS. We all know the general guidelines around writing functions and designing data structures - clear functionality, keeping it simple, following naming conventions and so on. Many good programmers do follow these, of course, in developing applications but these are probably much more important in the design and implementation of an API because an API has the potential to be used much more widely since it is exposed to the public. Compare this to any module within a particular product, which is only used within the product itself and only the developers working on that product need to understand it.

Some points that I incorporated into the CHelpLib library's APIs:

Consistent experience in the usage of all the APIs within the library. This has two aspects: Consistent API usage experience across all functions within the library and consistency with the industry or platform specific conventions.
Consistency with the industry/platform makes it easy for the programmer using the API. In my case, this was the Windows platform programming conventions since I was developing for this platform. Consistency of the API includes:

Object(data structures) creation, usage and disposal semantics must be similar across the library.
Function signatures and naming must be consistent and follow the Windows API model.
All functions must be consistent in the way they handle memory allocation/deallocation. This could either be left to the caller or usually shared between the caller and the callee.
All functions within the API must follow the same method of error/failure reporting to the caller.

A public API must be very well documented and the documentation must be up to date. This means the documentation must include what exactly each function does, how it is supposed to be used, what are the failure scenarios and so on.
All known bugs must be documented and made visible to the users.
An API/library is meant to provide functionality with as less complexity in usage as possible. To this end, the API must expose very little implementation details (this must be documented, however).
If the API consists of functions that provide functionality such as algorithms and data structures: the algorithm design, memory requirements and expected performance should be specified so that the users can make better choices and use the API in a better way.

The most important changes I made to the CHelpLib library in order to incorporate the above points:

Simplifying the naming of all functions and avoiding naming collisions with system functions. All function names follow this pattern: "CHL_{module_name}{function_name}". For example, CHL_DsCreateHT - 'Ds' indicates DataStructures, 'CreateHT' indicates that the function creates a hashtable.
Standardizing the error/failure reporting using HRESULT as the return type of most functions. This type is the recommended return type on the Windows platform.
Since return type is used only to indicate function success or failure, programmatic return values are passed back to caller via 'out' parameters, i.e., using pointer parameters. For example: CHL_DsCreateHT(**pHTableOut, nEstEntries, keyType, valType, fValInHeapMem). This function creates a hashtable and returns a pointer to the hashtable object in the first parameter pHTableOut.
Each module has its own header file. API users only need to include the used module's header file. Earlier, all exported functions and data structures were specified within a single header file.
Usage of Microsoft's SAL2(Source Annotation Language) to annotate the function parameters. This enables better static code checking and also conveys the meaning of the parameters better.
The data structures - hashtable, linkedlist and queue - all have identical support for storage and retrieval of values (and keys).
More test code was written to test the new changes and older tests provided the necessary regressions.
Writing documentation for the full API. This is still a work in progress.

The code for CHelpLib along with the Test code and documentation can be found at Github: https://github.com/shishir993/CHelpLib. Future work for this particular library will see additions in the area of data structures, algorithms and some stuff around Windows process access APIs.

September 7, 2014

Learnings From A Diffing Tool

TL;DR - Source code and screenshots.

Motivation

This post is about the things I learned in the course of developing a folder diffing tool I wrote recently, as a pastime project. Couple of weeks back I was wondering if there was any tool that I could use to diff two folders to find duplicate files and delete those from either folder. I wanted to see if I had any duplicates in the files that I copied from my friend. So, I started thinking about a diffing tool which offers these capabilities. I could think of BeyondCompare and few other diffing tools but off the top of my head, couldn't think of any tool that could diff folders and help me delete the duplicates from either of the folders.

I said to myself that this is the perfect opportunity for a project! So, I went home that day and immediately listed down my requirements for the diffing tool. I decided to call it 'FDiffDelete' - Folder Diff and Delete. The initial set of requirements that I came up with was:

Compare two folders(a left folder and a right folder), highlight the common/duplicate files(name and size match - later, use hash).
Give choice to user to delete the common/duplicate files from any folder.
Ver1.0: Only single level files in a folder. Later, recursive compare.
UI stuff: Has a dialog box. The box is split into two. Each has a text area to specify the folder path and a list view to display the file list in the specified folder(s). There is also a delete button in each half of the dialog box that enables the user to delete selected files(or all duplicate files) from either of the two folders.

I later realized that I should compare files based on their Last Modified Date as well, so I added that to the requirements list.

Duplicate files determined based on name, size, last-modified-date (later, or by using the file hash(SHA1)).

Nitty gritty design details

With the requirements set, I started designing the tool. I needed a data structure to actually hold information about the files in a folder and to also allow comparisons between files from the two folders. The very first thing that came to mind was a hash table to store file information of all files in the folder.

Pros and cons of hash table:

+ Fast inserts and lookups.
+ Pretty straight-forward implementation in C. There is no need for complicated stuff like rebalancing in case of a binary tree.
+ I already had a hash table implementation in my CHelpLib project :-) .

- Not very space efficient when size of hash table is much larger than the actual number of items in the table, i.e., when it is sparse.
- Iterating through the hash table wastes CPU cycles when it is sparse. You have to traverse all slots in the table to find all that have valid entries.
- Efficiency of a hash table is only as good as it's hash function. A bad one will introduce too many key conflicts and will lead to bad performance.

One other data structure to think was a tree. This will offer average insert and lookup times. It is more space efficient compared to a hash table. However, there is also the implementation complexity due to need for re-balancing a binary tree on inserts and also I did not have a binary-tree implementation in C. So I finally decided that the hash table is the way to go. So, the hash table would be (key, value) = (filename, fileinfo-struct). This hash table would be associated with a folder object(DIRINFO) to hold all files within the folder.

The folder info(DIRINFO) and file info(FILEINFO) structures used in the code:

// Structure to hold the files that have same name within
// the same directory tree. This is required because the hashtable
// in DIRINFO uses filename as key, which means there will be name
// conflicts.
typedef struct _dupWithin
{
    int nCurFiles;
    int nCurSize;
    PFILEINFO *apFiles;
}DUPFILES_WITHIN, *PDUPFILES_WITHIN;

typedef struct _DirectoryInfo
{
    WCHAR pszPath[MAX_PATH];
    int nDirs;
    int nFiles;

    // Was hash comapre used when this DIRINFO was built?
    BOOL fHashCompare;

    // Use a hashtable to store file list.
    // Key is the filename, value is a FILEINFO structure - if hash compare is turned OFF
    // Key is hash string, value is a PCHL_LLIST that is the
    // linked list of files with the same hash string - if hash compare is turned ON
    CHL_HTABLE *phtFiles;

    // List of FILEINFO of files that have the same name in 
    // the same dir tree. - if hash compare is turned OFF
    DUPFILES_WITHIN stDupFilesInTree;

}DIRINFO, *PDIRINFO;

// Structure to hold information about a file
typedef struct _FileInfo
{
    BOOL fIsDirectory;
    BOOL fAccessDenied;

    // This structure must know about the duplicacy of a file
    // because otherwise the directory must hold an additional
    // list of duplicate files.
    BYTE bDupInfo;
    
    LARGE_INTEGER llFilesize;
    BYTE abHash[HASHLEN_SHA1];
    SYSTEMTIME stModifiedTime;
    FILETIME ftModifiedTime;

    WCHAR szPath[MAX_PATH];
    WCHAR szFilename[MAX_PATH];

}FILEINFO, *PFILEINFO;

And so, I started to write the code and reached a good first version which was able to find duplicates based on the file's name, size and last-modified-date. I was even able to use it to find duplicates and delete them. Great! The tool served it's initial purpose. Now I wanted to implement recursive diff, i.e., diff a full folder tree and not just one level. Traversing the folder tree is easy, a BFS using a queue is quite trivial once you have a queue data structure. I did not have a queue implementation but I did have a linked-list implementation. I just added the Queue implementation to my CHelpLib project using linked list as the underlying data structure. Now I had to tackle the problem of how to store the files' information when recursively traversing the folder tree. Two options came to my mind:

Use a tree of folder-info structs to hold information of all files in each folder in the tree starting from the specified folder as the root. This brings the complication of finding duplicate files within two trees. I vaguely remembered a sub-graph compare problem from my algorithms class. But wait, this wasn't even a sub-graph compare problem. It was more trivial than that - how the heck do I search for a file by name among the two folders with an algorithm better than O(n*n). Remember that a duplicate file could be in any folder in the tree rooted at the specified folder.
Use only one hash table, as described earlier, to store the files' information from all folders in the tree. This offers the same fast inserts and is trivial to find duplicate files since the file name is a key.

I chose to use the second option of using hash table to store file info. So, files from all folders in the tree will be in a single hash table. Started writing the code and soon realized another issue to tackle - there could be files with same name in one folder tree. This was a common occurrence, for example, music files that are named as "Track 1", "Track 2". As we all know, hash table cannot handle the duplicate keys, it will just overwrite or fail to insert when a key already exists, depending upon the implementation. So, I had come to up with a way to store duplicate file names within a folder tree. I chose to put the duplicate files in a separate array of file-information structs. This worked out well. Yes it is slow, O(1) insert and O(n) worst case lookup but I do not expect too many duplicate file names in the folder tree as the most common scenario. This worked well and it was pretty fast too. I was able to diff the System32 folder with ~2k files in under 2 seconds on my Core i7 laptop.

Can I Haz Moar Features?

Next feature to add was file comparison using the hash of the file content(file-digest). I decided to use SHA1 as my hashing function since MD5 has more probability of hash collisions. Although, a SHA1 hash collision has also been found and I should probably use SHA2? I wanted to balance speed and collision-resistance. SHA1 is slightly faster than SHA2, however small the difference. I know, I know, I didn't put too much effort into thinking out this part. I can change the hash function easily enough. Anyway, coming back to the feature. In order to find duplicate files using their hashes, the file names no longer matter; only the file digest matters. The file name cannot be used as a key to the hash table in order to find duplicate files. The key to the hash table has to be 160bit(20byte) hash value of a file represented as a string. This brings another issue - files within same hash(duplicate files) within the same folder tree. I solved this by having a list of file-info structs as the value of the hash table entry. So, the hash table of the hash-comparison feature was (key, value) = (hash-value-as-string, list-of-file-info-structs).

By this time, the code had gotten quite complicated mainly due to the two very different ways of using the hash table to store files' information. One was using the file name as key and the other used the file's hash-value. So I decided to separate the folder traversal for the usual comparison and hash comparison features. This made is much easier to maintain and read the code.

Lessons learned

Data structures used can make or break a program. They can either make a program fast or painfully slow, a memory hogger or space efficient, easy to maintain vs. PITA spaghetti code. Choose wisely.
Start with a initial set of requirements and design accordingly. As you add more requirements, remember to go back to the design board since you may have to rethink the whole design(worst-case) when new requirements are added.
Object oriented concepts are your friend in making any challenging problem easy to solve and implement in code. Encapsulation and abstraction are very powerful and quite sufficient. You do not necessarily have to use polymorphism and inheritance all the time.

Source code is on github at: https://github.com/shishir993/fdiffdelete. Built using Microsoft VS2010 Professional and Windows SDK 8.100.26837. Dependent on my other C library CHelpLib.

Some screenshots of the tool at work:

Menu options and left folder loaded
Comparing System32 and SysWOW64 folders (seems that both have identical files?)

January 21, 2014

Determining Integer Overflow

This topic was running through my mind and I was wondering what the easiest way would be to detect overflow and underflow of integers. I searched on the internet and found some answers but all were not quite simple. Then it dawned on me that the CPU has an Overflow flag as part of the EFLAGS register! Consulted the Intel SW Developers manual and found that I was right. There are instructions that test this register - the JO (Jump if overflow) for signed integers and JC (Jump if carry) for unsigned integers. So, we perform an addition and then immediately check the status of the flags using the JO/JC instructions. The branch instruction targets can then return the correct value if this logic is enclosed within a function that tests this overflow condition. I have added the two small functions that test for overflow in INT and UINT to my CHelpLib git repository. I have included one of the functions below. Given two integers, this function returns TRUE if an addition causes overflow. It basically uses assembly instructions to perform the addition and then uses jo instruction to jump to the correct location that returns the TRUE or FALSE value.

BOOL Chl_fIsOverflowINT(int a, int b)
{
    __asm
    {
        mov eax, dword ptr [a]
        mov ecx, dword ptr [b]
        add eax, ecx
        jo ret_overflow
    }

    return FALSE;
    
ret_overflow:
    return TRUE;
}

Ring0 - The Inner Circle

Pages