Pages

May 15, 2022

The Psychology of Money: Timeless Lessons on Wealth, Greed, and Happiness by Morgan Housel

The Psychology of Money: Timeless Lessons on Wealth, Greed, and Happiness by Morgan Housel | Goodreads
  • Wealth is what you don't see.
  • Richness can be seen which could be from high leverage (debt).
  • A high savings rate will leave you quite wealthy over time.
  • The amount you can save is the difference between your income and ego.
  • Invest the way YOU are comfortable and for meeting YOUR goals. Do not copy/envy others blindly.
  • Trying to Keep Up With The Joneses? Think again.
  • The highest return that money provides: the freedom to do what you want, when you want, with whom you want and for however long you want. In other words, TIME is the highest return that money can provide.

October 8, 2016

x86 vs. x64 assembly language

I've been planning to add 64bit support to my disassembler (github) for sometime now. So I started reading about 64bit assembly from the Intel manuals to see how it differs from 32bit code. To see  the differences in live code, I wrote a simple program and looked through the disassembly. Here's what I found out so far. Refer to the source code and disassembly at the bottom while reading -


[Main differences]
1. Absolute code addresses are 32bits vs. 64bits
2. Instructions use the REX prefix byte in 64bit mode to specify 64bit operands. The REX byte is placed immediately before the first opcode byte and after all legacy prefixes.
3. 64bit mode has more registers available so more arguments are passed via registers than on the stack. The call to printf is smaller in 64bit.
4. The REX prefix occupies range 40h - 4Fh, so the single byte opcode forms of the INC/DEC instructions, that occupy the same range, are not available in 64bit mode. The ModR/M forms of these instructions (FFh) is still available.

[REX byte]
This is a byte-sized prefix used in 64bit mode in order specify 64bit operands and registers of size other than 32bits. Format of the REX byte is -
Bit position - 7  6  5  4  3  2  1  0
Bit value    - 0  1  0  0  W  R  X  B

The high nibble is always 0100 which is 4h. So the REX prefix range is 40h – 4Fh.
W bit : 1 – 64bit operand, 0 – operand size determined by CS.D bit (CodeSegment descriptor, refer Intel system programming manual).
R bit : This adds a 4th bit to the reg field in ModR/M byte. Enables addressing the new registers (R8-R15).
X bit : This adds a 4th bit to the Index field in SIB byte.
B bit : This adds a 4th bit to the r/m field in ModR/M or the Base field in SIB or Opcode reg field.

[Dissecting two instructions]

00007FF780AE17F5 48 C7 45 28 00 00 00 00 mov  qword ptr [f0], 0
REX = 48 : which means only the W bit is set, indicating 64bit operand (the qword ptr [f0] operand).

00007FF780AE1876 4C 8B 45 68    mov  r8, qword ptr [fn]
REX   = 4C : which means both W and R bits are set. W indicates 64bit operand and the R bit adds a 4th bit to the reg field in ModRM byte.

MOV   = 8B : this is 64bit register/memory to 64bit register MOV instruction

ModRM = 45 : Mod = 01, R/M = 101 – this is [RBP]+disp so this is the effective address of 'fn'. Register R8 is identified by the 4bit reg field 1000.

Mod   R  Reg     R/M
7 6     5 4 3   2 1 0
0 1   1 0 0 0   1 0 1    (the Rbit from REX adds a 4th bit to the reg field)

Disp  = 68 : This is the 8bit displacement added to RBP to get effective address of 'fn'.

int main()
{
    UINT n = 5;
    size_t f0 = size_t(0);
    size_t f1 = size_t(1);
    size_t fn = f0;
    if (n == 1)
    {
        fn = f0 + f1;
    }
    else if (n > 1)
    {
        for (UINT itr = 2; itr <= n; ++itr)
        {
            fn = f0 + f1;
            f0 = f1;
            f1 = fn;
        }
    }
 
    printf("Fibonacci(%u) = %Iu\n", n, fn);
    return 0;
}


Compiled as 32bit (snippets)
int main()
{
00B217B0 55                   push ebp
00B217B1 8B EC                mov ebp,esp
00B217B3 81 EC FC 00 00 00    sub esp,0FCh

    UINT n = 5;
00B217CE C7 45 F8 05 00 00 00 mov dword ptr [n],5

    size_t f0 = size_t(0);
00B217D5 C7 45 EC 00 00 00 00 mov dword ptr [f0],0

    size_t fn = f0;
00B217E3 8B 45 EC             mov eax,dword ptr [f0]
00B217E6 89 45 D4             mov dword ptr [fn],eax

    fn = f0 + f1;
00B2181A 8B 45 EC             mov eax,dword ptr [f0]
00B2181D 03 45 E0             add eax,dword ptr [f1]
00B21820 89 45 D4             mov dword ptr [fn],eax

    printf("Fibonacci(%u) = %Iu\n", n, fn);
00B21831 8B 45 D4             mov eax,dword ptr [fn]
00B21834 50                   push eax
00B21835 8B 4D F8             mov ecx,dword ptr [n]
00B21838 51 push ecx
00B21839 68 30 6B B2 00       push offset string "Fibonacci(%u) = %Iu\n" (0B26B30h)
00B2183E E8 DD FA FF FF       call _printf (0B21320h)
00B21843 83 C4 0C             add esp,0Ch

Compiled as 64bit (snippets)
int main()
{
00007FF780AE17D0 40 55                   push rbp
00007FF780AE17D2 57                      push rdi
00007FF780AE17D3 48 81 EC 88 01 00 00    sub rsp,188h

    UINT n = 5;
00007FF780AE17EE C7 45 04 05 00 00 00    mov dword ptr [n],5

    size_t f0 = size_t(0);
00007FF780AE17F5 48 C7 45 28 00 00 00 00 mov qword ptr [f0],0

    size_t fn = f0;
00007FF780AE1805 48 8B 45 28             mov rax,qword ptr [f0]
00007FF780AE1809 48 89 45 68             mov qword ptr [fn],rax

    fn = f0 + f1;
00007FF780AE1852 48 8B 45 48             mov rax,qword ptr [f1]
00007FF780AE1856 48 8B 4D 28             mov rcx,qword ptr [f0]
00007FF780AE185A 48 03 C8                add rcx,rax
00007FF780AE185D 48 8B C1                mov rax,rcx
00007FF780AE1860 48 89 45 68             mov qword ptr [fn],rax

    printf("Fibonacci(%u) = %Iu\n", n, fn);
00007FF780AE1876 4C 8B 45 68             mov r8,qword ptr [fn]
00007FF780AE187A 8B 55 04                mov edx,dword ptr [n]
00007FF780AE187D 48 8D 0D A4 83 00 00    lea rcx,[string "Fibonacci(%u) = %Iu\n" (07FF780AE9C28h)]   
00007FF780AE1884 E8 43 F9 FF FF call     printf (07FF780AE11CCh)

October 23, 2015

Using CRITICAL_SECTION internals to debug

I was recently faced with debugging a not-very-frequently-occurring bug which was causing a crash in a user-mode application whenever it was hit. I had couple of memory dumps from the crashes. Looking at the source code didn't get me anywhere because I couldn't trace through any code path that could cause this. This was C++ code and I noticed that the crash was an access violation in a class member function, because 'this' object was already deleted while the member function was still on the stack.

For the sake of this post, I'm going to call the class BadClass and the access violating member function funcAV(). BadClass has a critical section data member that I'll call m_csBadClass. funcAV was entering the critical section as its very first line of code, then checking if the m_fShtudown boolean data member was set and leaving the critical section just before returning from the function. The function called several other functions while inside the critical section. The class also had a shutdown() function that was always invoked before deletion and that entered the same critical section, set the m_fShutdown flag and returned.

void BadClass::funcAV() {
    EnterCriticalSection(&m_csBadClass)
    if (!m_fShutdown) {
        // do some work
        // call other functions
        // use data member, say, m_intA <-- access violation reading this
    }
    LeaveCriticalSection(&m_csBadClass)
}

void BadClass::shutdown() {
    EnterCriticalSection(&m_csBadClass)
    m_fShutdown = true
    LeaveCriticalSection(&m_csBadClass)    
}

Since the shutdown() function was always called before object deletion and knowing that shutdown() entered the m_csBadClass critical section, I concluded that the object was deleted by the same thread where funcAV() is executing because the object was deleted while funcAV() was inside the critical section. That is, some callee-of-callee-of-funcAV was deleting the object. Remember that critical sections can be entered recursively by the same thread. You must be wondering now, "Dude, why can't you just place a breakpoint/assertion in the class destructor to see who was calling the destructor?". I could but not in this case because there were legitimate cases where the destructor is called and everything is fine in this scenario. I had to break only when the object was being deleted while funcAV() was still on the callstack and was inside the critical section.

Knowing that the only way for the object of BadClass to get deleted is if the critical section was being entered recursively, I placed an assertion in BadClass::shutdown() like this:

void BadClass::shutdown() {
    EnterCriticalSection(&m_csBadClass)
    NT_ASSERT(m_csBadClass.RecursionCount < 2)
    m_fShutdown = true
    LeaveCriticalSection(&m_csBadClass)    
}

Then I let the code run on our test environment overnight and I had couple of crash dumps the next morning. These new crashes, to my relief, were from the assertion! Now I got to know the exact code path that was causing the object to get deleted and the fix was easy knowing the root cause. The code path was something like this:

ntdll!breakpoint <-- for the assertion
myDll!BadClass::shutdown
... couple of other functions
myDll!ClassC::func2
myDll!ClassB::func1
myDll!BadClass::funcAV
ntdll!thread-start-function

So knowing the internals of the CRITICAL_SECTION structure certainly made figuring out this bug easy. Other fields in the structure also offer a lot of insight that may one day be useful -

struct _RTL_CRITICAL_SECTION {
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;
    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;
}

DebugInfo - Pointer to a RTL_CRITICAL_SECTION_DEBUG structure which gives more information about this critical section. This structure is actually part of a linked list whose elements are the debug structures of all critical sections used in the current process. This will help in detecting deadlocks during runtime and for other debugging purposes.

LockCount - This field name is a little misleading because it actually more than one datum.
  • Bit0 - Whether the critical section is unlocked (1) or locked (0)
  • Bit1 - Whether the critical section has woken up a waiting thread (0) or not (1)
  • Bit2 thru 31 - #threads waiting on this critical section
RecursionCount - How many times has this thread entered this critical section recursively.

OwningThread - ThreadID of the thread that currently owns this critical section. Zero if the critical section is unlocked, i.e., not owned by any thread.

LockSemaphore - Misleading field name. Usually zero but in cases where it isn't zero, it is a HANDLE to an event used by the kernel to signal a waiting thread when another thread leaves this critical section. That is, the kernel internally uses this event to wake up one of the 'waiting' threads when the contented critical section becomes free. I've observed that this always -1 (see example below) after introducing a contention situation so I will dig up some more info on how exactly this can be an event handle.

SpinCount - Putting a thread into waiting state during contention for the critical section is quite an expensive operation because it involves transitioning into kernel mode. If we know that a particular critical section will be held for very very short periods at a time but multiple threads use this critical section, the SpinCount can be given a valid positive number, say 1000000. This indicates that when there is contention, the thread must enter a loop until the critical section is free or the number of loop iterations reaches the SpinCount value. This is essentially a busy-wait in user-mode to see if the contention gets resolved. This avoids transitioning into kernel-mode for a critical section that could get released in the next moment.

Sample program to see these some of these fields in action:


#include <Windows.h>
#include <stdio.h>

#define NUM_CONTENDING_THREADS      3

#define CS_LOCKCOUNT_UNLOCKED(lc)   ((lc) & 0x00000001)             // bit0
#define CS_LOCKCOUNT_TH_WOKEN(lc)   (((lc) & 0x00000002) >> 1)      // bit1
#define CS_LOCKCOUNT_CONTENTION(lc) (((~(lc)) & 0xfffffffc) >> 2)   // bit2 to 31


void DumpCSInfo(PCWSTR pszMsg, PCRITICAL_SECTION pcs)
{
    wprintf(L"[%5u] %-16s contention = %u, locked = %-3s, woken = %-3s, lockSem = 0x%p\n",
        GetCurrentThreadId(),
        pszMsg,
        CS_LOCKCOUNT_CONTENTION(pcs->LockCount),
        CS_LOCKCOUNT_UNLOCKED(pcs->LockCount) ? L"no" : L"yes",
        CS_LOCKCOUNT_TH_WOKEN(pcs->LockCount) ? L"no" : L"yes",
        pcs->LockSemaphore);
}

DWORD WINAPI GrabCriticalSection(LPVOID lpParameter)
{
    PCRITICAL_SECTION pcs = (PCRITICAL_SECTION)lpParameter;
    
    DumpCSInfo(L"Waiting", pcs);

    EnterCriticalSection(pcs);
    
    DumpCSInfo(L"Inside", pcs);
    Sleep(5000);

    LeaveCriticalSection(pcs);
    DumpCSInfo(L"Left", pcs);

    return ERROR_SUCCESS;
}

int main()
{
    DWORD tid;
    HANDLE hThreads[NUM_CONTENDING_THREADS];
    
    CRITICAL_SECTION cs;

    WCHAR szMsg[16];

    InitializeCriticalSection(&cs);
    DumpCSInfo(L"Init", &cs);

    EnterCriticalSection(&cs);
    DumpCSInfo(L"Inside", &cs);

    for (int i = 0; i < ARRAYSIZE(hThreads); ++i)
    {
        hThreads[i] = CreateThread(NULL, 0, GrabCriticalSection, &cs, CREATE_SUSPENDED, &tid);
    }

    for (int i = 0; i < ARRAYSIZE(hThreads); ++i)
    {
        ResumeThread(hThreads[i]);
        Sleep(1000);
        swprintf_s(szMsg, ARRAYSIZE(szMsg), L"R th %d", i);
        DumpCSInfo(szMsg, &cs);
    }

    Sleep(2000);
    LeaveCriticalSection(&cs);
    DumpCSInfo(L"Left", &cs);

    WaitForMultipleObjects(ARRAYSIZE(hThreads), hThreads, TRUE, INFINITE);

    DumpCSInfo(L"Exiting", &cs);
    return 0;
}

Output:

[10192] Init     contention = 0, locked = no , woken = no , lockSem = 0x0000000000000000
[10192] Inside   contention = 0, locked = yes, woken = no , lockSem = 0x0000000000000000
[ 5676] Waiting  contention = 0, locked = yes, woken = no , lockSem = 0x0000000000000000
[10192] R th 0   contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Waiting  contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] R th 1   contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Waiting  contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] R th 2   contention = 3, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] Left     contention = 2, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[ 5676] Inside   contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 5676] Left     contention = 1, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Inside   contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Left     contention = 0, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Inside   contention = 0, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Left     contention = 0, locked = no , woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] Exiting  contention = 0, locked = no , woken = no , lockSem = 0xFFFFFFFFFFFFFFFF

October 27, 2014

How Big Are Your Functions?

Couple of weeks back, there was a need to find out the code size of functions in a program I had written. I was just curious as to which functions were the biggest space hoggers. By 'code size of functions', I mean the number of bytes a function's machine code occupy in the .text section of the binary file (or in memory when loaded). This does not include the stack/heap space used by the function. Doing this exercise will come in handy when there is a need to reduce your program's memory footprint; in other words you want to achieve space optimization. Of course, optimizing for space includes stack/heap optimization along with code size reductions. Keep in mind that code size does affect a program's memory footprint if you are using a lot of macros and inline functions.

Disassembler to the rescue!

First thing that came to my mind was to use dumpbin to disassemble the binary code and find out the code size of a function by subtracting the function start address and the end address. This works fine but is tedious if you want to measure the size of multiple (or all) functions within a program.

?DestroyDirInfo_NoHash@@YAXPAU_DirectoryInfo@@@Z:
  00413560: 55                 push    ebp
  00413561: 8B EC              mov     ebp,esp
  00413563: 81 EC D8 00 00 00  sub     esp,0D8h
  ...
  ...
  00413662: 8B E5              mov     esp,ebp
  00413664: 5D                 pop     ebp
  00413665: C3                 ret

This is a snippet from the disassembly of the function DestroyDirInfo_NoHash, one of the functions in FDiffDelete program. So we can calculate the code size as:
0x00413665 - 0x00413560 + 1 = 0x106 bytes (or 262 bytes)


DIA, I've got my eye on you

Sometime back, a colleague mentioned that the Debug Interface Access (DIA) SDK can be used for the same purpose. I didn't get more details from him that day and the topic just slipped my mind until this weekend. I started digging around MSDN to find out how I can use the DIA SDK to find out the functions' code size.

Reading through the articles, I learned that the DIA is an abstraction layer over PDB files and is implemented as an in-proc COM server. A PDB file is the Program DataBase file that holds debug information for a binary file - stuff like local variable address, function address, etc. So I thought - 'Yes, this could give me the code size of functions'. Digging further into the articles, I found that what you need is:
  • Path to the PDB file corresponding to the exe/dll file whose functions you want to analyze.
  • (OR) Path to the exe/dll file and optionally a search path where to look for the corresponding PDB file.
With the above information in hand, the process of enumerating all functions in the binary and finding out their size is quite straight-forward. Only hurdle is the usage of COM specific stuff if you aren't familiar with it. So the basic outline of using DIA for our 'code sizing' purpose is shown below:

Control Flow

I've uploaded this code to github here: CodeSizer. It shows the undecorated function names and their size in bytes in descending order of size. It also has options to show all functions or specific functions or functions that match the specified sub-string.

Going back to our earlier example of calculating the size of the function DestroyDirInfo_NoHash, see the output from CodeSizer, the size is 262 bytes:
> CodeSizer.exe  
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe 
> /s DestroyDirInfo_NoHash
Function Name                  Size In Bytes
-------------                  -------------
DestroyDirInfo_NoHash          262
Enumerated #functions = 1

Sizes of all functions that have 'Build' as a sub-string:
> CodeSizer.exe 
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe 
> /r Build
Function Name                  Size In Bytes
-------------                  -------------
BuildFilesInDir_NoHash         2620
BuildFilesInDir_Hash           2350
BuildDirTree_Hash              938
BuildDirTree_NoHash            938
BuildDirInfo                   889
BuildFilesInDir                107
BuildDirTree                   99
Enumerated #functions = 7

Thus, a manual task becomes an easy automation that can be used frequently in future. This is what we programmers are good at, isn't it? We are lazy, and that's a good thing, in a way!

Peeking Into cryptbase.dll

I was curious to look at the sizes of some system DLL and chose to look into the AES functions in cryptbase. Strange thing I noticed when doing this is I couldn't get function information from this dll (or ntdll.dll either) by using the same technique as for my own DLLs. I had to look for public symbols in the PDB's global scope and then filter in only the functions. My guess is that system DLLs do not have debug information for non-exported functions. Here are the top 10 largest AES functions from cryptbase.dll:

Function Name                  Size In Bytes
-------------                  -------------
AesExpandKey                   864
AesCbcDecrypt                  832
AesCbcEncrypt                  752
AesDecrypt                     656
AesEncrypt                     640
AesCtrRng_Generate             448
AesCtrRng_Instantiate          292
AesCtrRng_Update               292
AesCtrRng_Reseed               216
AesCtr_safe_startup            172

What's Next?

The DIA SDK, although very powerful, is quite tedious to use because it involves a lot of function calls to get stuff out. So, building an abstraction layer on top of it such as in the CodeSizer project will make it simple. Such an abstraction could be used as the DIA front-end in a debugger since the debugger makes heavy use of PDB files in order to show debug information.

October 16, 2014

Interlocked Instructions

In this post, I will discuss the implementation of the InterlockedIncrement API (and its counterpart, the InterlockedDecrement) found in Win32 SDK. I will also talk about how it is used in COM to track object lifetimes. This API is declared in WinBase.h and exported by kernel32.dll as specified in MSDN. This API is one of the many Interlocked functions available in the Win32 SDK. These APIs help in implementation of various synchronization mechanisms such as semaphores.

I first came across InterlockedIncrement when going through a COM application's source code. Every COM class is derived from the IUnknown interface and this interface has AddRef and Release functions. As specified in the MSDN article, these functions are together used to keep track of a COM object's lifetime by the reference counting mechanism. A COM server is designed such that a single server object can be used by more than one client. Whenever a COM client obtains a pointer to a COM server object, a call to AddRef must be made and when the object is no longer needed, a call to Release must be made. This way, the server can keep track of how many references to its object are present and can clean itself up when the reference count goes down to zero. The below snippet of pseudo-code should give you an idea of how this is implemented:

class ComServer : public IUnknown {
    UINT nRefs;
    ULONG AddRef() { InterlockedIncrement(&nRefs); } // ++nRefs
    ULONG Release() { InterlockedDecrement(&nRefs); } // --nRefs
    HRESULT QueryInterface(clsid, **ppObject)
    {
        ...
        AddRef();
        return pointer to ComServer object
    }
}

Since COM objects are designed to be used by multiple clients in a multi-threaded (and multi-processor) environment, the incrementing and decrementing of the counter variable (nRefs in above example) must be controlled using some synchronization mechanisms. This is where the InterlockedIncrement and InterlockedDecrement functions come into picture.

Implementation

According to MSDN, InterlockedIncrement "Increments (increases by one) the value of the specified 32-bit variable as an atomic operation." and the InterlockedDecrement function decreases the value of its argument by one. Let us see how these two functions ensure that operations on their arguments are atomic/synchronized. Looking at the assembly code of the function call and implementation of InterlockedIncrement:
Function Call

Function Implementation

The function call is a regular call to a system API. The function implementation is just 5 instructions. The increment is done by a single instruction - the xadd instruction which has a lock prefix. The Intel Developer's Manual Vol2 says about xadd: "This instruction exchanges the first operand(destination) with the second operand(source), then loads the sum of the two values into the destination operand." So ecx contains the address of the memory variable to be incremented and eax contains 1 before the xadd instruction. The xadd instruction does this:

Assert LOCK signal
        |
 temp = eax + *ecx
        |
    eax = *ecx
        |
   *ecx = temp

We can see that even though it is a single assembly instruction, it involves three steps. The LOCK signal assertion is the one that ensures the instruction executes as an atomic operation.

Looking at the same document for the lock prefix: "Asserts LOCK# signal for duration of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted".

What this means is that the when the LOCK signal is asserted, no other processor or bus agent(could be an memory-mapped IO device) can read/write to the main memory. This ensures that the read-write operations of the variable to be incremented is atomic. All other processors and bus agents must wait until this instruction is complete before accessing the main memory. The very simplified diagram of the system architecture should make it easy to understand:


What If The Variable is Cached?

There is a possibility that the memory variable to incremented/decremented is cached in the processor's cache. In this case, the processor's cache coherency mechanism ensures that the operation is atomic. Intel's developers manual Vol3 states that recent processors do not even assert the LOCK signal if the variable is in the cache, instead uses just the cache locking (a technique to achieve cache coherency) mechanism.

A Look At The 64bit Implementation

The assembly code above is from a 32bit build of the source code. I was curious if the 64bit implementation was any different. So, changing the architecture in VisualStudio, compiling the code and I get this:

No function call?? What's going on? I then looked at the WinBase.h header and found that this function is declared with the __forceinline compiler directive. So this explains why there is no function call. Simply, the address of nRefs is loaded into rax and the inc instruction increments the variable. However, the InterlockedIncrement function is supposed to return the updated value as the function return value. Those two instructions don't do that. So I modified the source code to save the return value to a temporary variable to see what the compiler does:

long tmp = InterlockedIncrement(&nRefs);

This time I saw an implementation similar to the 32bit version but is inlined:





How Does Decrement Work?

There is no subtract equivalent of the xadd instruction, so what does the compiler do? As you see below, it is quite clever - it just uses xadd with -1 so essentially it achieves *ecx = *ecx + (-1).



October 7, 2014

So You Want to Write An API?

The topic of this post is the design of an API. These are things I learnt in the course of writing the CHelpLib library. As most programmers know, API stands for Application Programming Interface. An API consists of a set of functions and data structures that provide some functionality to the user of the API. The most common example is the interface offered by an OS. All functions that are part of the OS, exposed to kernel mode applications like device drivers and user mode applications such as MS Word, make up the API of the OS. We all know the general guidelines around writing functions and designing data structures - clear functionality, keeping it simple, following naming conventions and so on. Many good programmers do follow these, of course, in developing applications but these are probably much more important in the design and implementation of an API because an API has the potential to be used much more widely since it is exposed to the public. Compare this to any module within a particular product, which is only used within the product itself and only the developers working on that product need to understand it.

Some points that I incorporated into the CHelpLib library's APIs:

  • Consistent experience in the usage of all the APIs within the library. This has two aspects: Consistent API usage experience across all functions within the library and consistency with the industry or platform specific conventions.
  • Consistency with the industry/platform makes it easy for the programmer using the API. In my case, this was the Windows platform programming conventions since I was developing for this platform. Consistency of the API includes:
    • Object(data structures) creation, usage and disposal semantics must be similar across the library.
    • Function signatures and naming must be consistent and follow the Windows API model.
    • All functions must be consistent in the way they handle memory allocation/deallocation. This could either be left to the caller or usually shared between the caller and the callee.
    • All functions within the API must follow the same method of error/failure reporting to the caller.
  • A public API must be very well documented and the documentation must be up to date. This means the documentation must include what exactly each function does, how it is supposed to be used, what are the failure scenarios and so on.
  • All known bugs must be documented and made visible to the users.
  • An API/library is meant to provide functionality with as less complexity in usage as possible. To this end, the API must expose very little implementation details (this must be documented, however).
  • If the API consists of functions that provide functionality such as algorithms and data structures: the algorithm design, memory requirements and expected performance should be specified so that the users can make better choices and use the API in a better way.

The most important changes I made to the CHelpLib library in order to incorporate the above points:

  • Simplifying the naming of all functions and avoiding naming collisions with system functions. All function names follow this pattern: "CHL_{module_name}{function_name}". For example, CHL_DsCreateHT - 'Ds' indicates DataStructures, 'CreateHT' indicates that the function creates a hashtable.
  • Standardizing the error/failure reporting using HRESULT as the return type of most functions. This type is the recommended return type on the Windows platform.
  • Since return type is used only to indicate function success or failure, programmatic return values are passed back to caller via 'out' parameters, i.e., using pointer parameters. For example: CHL_DsCreateHT(**pHTableOut, nEstEntries, keyType, valType, fValInHeapMem). This function creates a hashtable and returns a pointer to the hashtable object in the first parameter pHTableOut.
  • Each module has its own header file. API users only need to include the used module's header file. Earlier, all exported functions and data structures were specified within a single header file.
  • Usage of Microsoft's SAL2(Source Annotation Language) to annotate the function parameters. This enables better static code checking and also conveys the meaning of the parameters better.
  • The data structures - hashtable, linkedlist and queue - all have identical support for storage and retrieval of values (and keys).
  • More test code was written to test the new changes and older tests provided the necessary regressions.
  • Writing documentation for the full API. This is still a work in progress.
The code for CHelpLib along with the Test code and documentation can be found at Github: https://github.com/shishir993/CHelpLib. Future work for this particular library will see additions in the area of data structures, algorithms and some stuff around Windows process access APIs.

September 7, 2014

Learnings From A Diffing Tool

TL;DR - Source code and screenshots.

Motivation

This post is about the things I learned in the course of developing a folder diffing tool I wrote recently, as a pastime project. Couple of weeks back I was wondering if there was any tool that I could use to diff two folders to find duplicate files and delete those from either folder. I wanted to see if I had any duplicates in the files that I copied from my friend. So, I started thinking about a diffing tool which offers these capabilities. I could think of BeyondCompare and few other diffing tools but off the top of my head, couldn't think of any tool that could diff folders and help me delete the duplicates from either of the folders.

I said to myself that this is the perfect opportunity for a project! So, I went home that day and immediately listed down my requirements for the diffing tool. I decided to call it 'FDiffDelete' - Folder Diff and Delete. The initial set of requirements that I came up with was:

  1. Compare two folders(a left folder and a right folder), highlight the common/duplicate files(name and size match - later, use hash).
  2. Give choice to user to delete the common/duplicate files from any folder.
  3. Ver1.0: Only single level files in a folder. Later, recursive compare.
  4. UI stuff: Has a dialog box. The box is split into two. Each has a text area to specify the folder path and a list view to display the file list in the specified folder(s). There is also a delete button in each half of the dialog box that enables the user to delete selected files(or all duplicate files) from either of the two folders.
I later realized that I should compare files based on their Last Modified Date as well, so I added that to the requirements list.
  • Duplicate files determined based on name, size, last-modified-date (later, or by using the file hash(SHA1)).

Nitty gritty design details

With the requirements set, I started designing the tool. I needed a data structure to actually hold information about the files in a folder and to also allow comparisons between files from the two folders. The very first thing that came to mind was a hash table to store file information of all files in the folder.

Pros and cons of hash table:

  • + Fast inserts and lookups.
  • + Pretty straight-forward implementation in C. There is no need for complicated stuff like rebalancing in case of a binary tree.
  • + I already had a hash table implementation in my CHelpLib project :-) .

  • - Not very space efficient when size of hash table is much larger than the actual number of items in the table, i.e., when it is sparse.
  • - Iterating through the hash table wastes CPU cycles when it is sparse. You have to traverse all slots in the table to find all that have valid entries.
  • - Efficiency of a hash table is only as good as it's hash function. A bad one will introduce too many key conflicts and will lead to bad performance.

One other data structure to think was a tree. This will offer average insert and lookup times. It is more space efficient compared to a hash table. However, there is also the implementation complexity due to need for re-balancing a binary tree on inserts and also I did not have a binary-tree implementation in C. So I finally decided that the hash table is the way to go. So, the hash table would be (key, value) = (filename, fileinfo-struct). This hash table would be associated with a folder object(DIRINFO) to hold all files within the folder.

The folder info(DIRINFO) and file info(FILEINFO) structures used in the code:

// Structure to hold the files that have same name within
// the same directory tree. This is required because the hashtable
// in DIRINFO uses filename as key, which means there will be name
// conflicts.
typedef struct _dupWithin
{
    int nCurFiles;
    int nCurSize;
    PFILEINFO *apFiles;
}DUPFILES_WITHIN, *PDUPFILES_WITHIN;

typedef struct _DirectoryInfo
{
    WCHAR pszPath[MAX_PATH];
    int nDirs;
    int nFiles;

    // Was hash comapre used when this DIRINFO was built?
    BOOL fHashCompare;

    // Use a hashtable to store file list.
    // Key is the filename, value is a FILEINFO structure - if hash compare is turned OFF
    // Key is hash string, value is a PCHL_LLIST that is the
    // linked list of files with the same hash string - if hash compare is turned ON
    CHL_HTABLE *phtFiles;

    // List of FILEINFO of files that have the same name in 
    // the same dir tree. - if hash compare is turned OFF
    DUPFILES_WITHIN stDupFilesInTree;

}DIRINFO, *PDIRINFO;

// Structure to hold information about a file
typedef struct _FileInfo
{
    BOOL fIsDirectory;
    BOOL fAccessDenied;

    // This structure must know about the duplicacy of a file
    // because otherwise the directory must hold an additional
    // list of duplicate files.
    BYTE bDupInfo;
    
    LARGE_INTEGER llFilesize;
    BYTE abHash[HASHLEN_SHA1];
    SYSTEMTIME stModifiedTime;
    FILETIME ftModifiedTime;

    WCHAR szPath[MAX_PATH];
    WCHAR szFilename[MAX_PATH];

}FILEINFO, *PFILEINFO;

And so, I started to write the code and reached a good first version which was able to find duplicates based on the file's name, size and last-modified-date. I was even able to use it to find duplicates and delete them. Great! The tool served it's initial purpose. Now I wanted to implement recursive diff, i.e., diff a full folder tree and not just one level. Traversing the folder tree is easy, a BFS using a queue is quite trivial once you have a queue data structure. I did not have a queue implementation but I did have a linked-list implementation. I just added the Queue implementation to my CHelpLib project using linked list as the underlying data structure. Now I had to tackle the problem of how to store the files' information when recursively traversing the folder tree. Two options came to my mind:

  1. Use a tree of folder-info structs to hold information of all files in each folder in the tree starting from the specified folder as the root. This brings the complication of finding duplicate files within two trees. I vaguely remembered a sub-graph compare problem from my algorithms class. But wait, this wasn't even a sub-graph compare problem. It was more trivial than that - how the heck do I search for a file by name among the two folders with an algorithm better than O(n*n). Remember that a duplicate file could be in any folder in the tree rooted at the specified folder.
  2. Use only one hash table, as described earlier, to store the files' information from all folders in the tree. This offers the same fast inserts and is trivial to find duplicate files since the file name is a key.
I chose to use the second option of using hash table to store file info. So, files from all folders in the tree will be in a single hash table. Started writing the code and soon realized another issue to tackle - there could be files with same name in one folder tree. This was a common occurrence, for example, music files that are named as "Track 1", "Track 2". As we all know, hash table cannot handle the duplicate keys, it will just overwrite or fail to insert when a key already exists, depending upon the implementation. So, I had come to up with a way to store duplicate file names within a folder tree. I chose to put the duplicate files in a separate array of file-information structs. This worked out well. Yes it is slow, O(1) insert and O(n) worst case lookup but I do not expect too many duplicate file names in the folder tree as the most common scenario. This worked well and it was pretty fast too. I was able to diff the System32 folder with ~2k files in under 2 seconds on my Core i7 laptop.

Can I Haz Moar Features?

Next feature to add was file comparison using the hash of the file content(file-digest). I decided to use SHA1 as my hashing function since MD5 has more probability of hash collisions. Although, a SHA1 hash collision has also been found and I should probably use SHA2? I wanted to balance speed and collision-resistance. SHA1 is slightly faster than SHA2, however small the difference. I know, I know, I didn't put too much effort into thinking out this part. I can change the hash function easily enough. Anyway, coming back to the feature. In order to find duplicate files using their hashes, the file names no longer matter; only the file digest matters. The file name cannot be used as a key to the hash table in order to find duplicate files. The key to the hash table has to be 160bit(20byte) hash value of a file represented as a string. This brings another issue - files within same hash(duplicate files) within the same folder tree. I solved this by having a list of file-info structs as the value of the hash table entry. So, the hash table of the hash-comparison feature was (key, value) = (hash-value-as-string, list-of-file-info-structs).

By this time, the code had gotten quite complicated mainly due to the two very different ways of using the hash table to store files' information. One was using the file name as key and the other used the file's hash-value. So I decided to separate the folder traversal for the usual comparison and hash comparison features. This made is much easier to maintain and read the code.

Lessons learned

  • Data structures used can make or break a program. They can either make a program fast or painfully slow, a memory hogger or space efficient, easy to maintain vs. PITA spaghetti code. Choose wisely.
  • Start with a initial set of requirements and design accordingly. As you add more requirements, remember to go back to the design board since you may have to rethink the whole design(worst-case) when new requirements are added.
  • Object oriented concepts are your friend in making any challenging problem easy to solve and implement in code. Encapsulation and abstraction are very powerful and quite sufficient. You do not necessarily have to use polymorphism and inheritance all the time.

Source code is on github at: https://github.com/shishir993/fdiffdelete. Built using Microsoft VS2010 Professional and Windows SDK 8.100.26837. Dependent on my other C library CHelpLib.

Some screenshots of the tool at work: