Interlocked Instructions

In this post, I will discuss the implementation of the InterlockedIncrement API (and its counterpart, the InterlockedDecrement) found in Win32 SDK. I will also talk about how it is used in COM to track object lifetimes. This API is declared in WinBase.h and exported by kernel32.dll as specified in MSDN. This API is one of the many Interlocked functions available in the Win32 SDK. These APIs help in implementation of various synchronization mechanisms such as semaphores.

I first came across InterlockedIncrement when going through a COM application's source code. Every COM class is derived from the IUnknown interface and this interface has AddRef and Release functions. As specified in the MSDN article, these functions are together used to keep track of a COM object's lifetime by the reference counting mechanism. A COM server is designed such that a single server object can be used by more than one client. Whenever a COM client obtains a pointer to a COM server object, a call to AddRef must be made and when the object is no longer needed, a call to Release must be made. This way, the server can keep track of how many references to its object are present and can clean itself up when the reference count goes down to zero. The below snippet of pseudo-code should give you an idea of how this is implemented:

class ComServer : public IUnknown {
UINT nRefs;
ULONG AddRef() { InterlockedIncrement(&nRefs); } // ++nRefs
ULONG Release() { InterlockedDecrement(&nRefs); } // --nRefs
HRESULT QueryInterface(clsid, **ppObject)
{
...
AddRef();
return pointer to ComServer object
}
}

Since COM objects are designed to be used by multiple clients in a multi-threaded (and multi-processor) environment, the incrementing and decrementing of the counter variable (nRefs in above example) must be controlled using some synchronization mechanisms. This is where the InterlockedIncrement and InterlockedDecrement functions come into picture.

Implementation

According to MSDN, InterlockedIncrement "Increments (increases by one) the value of the specified 32-bit variable as an atomic operation." and the InterlockedDecrement function decreases the value of its argument by one. Let us see how these two functions ensure that operations on their arguments are atomic/synchronized. Looking at the assembly code of the function call and implementation of InterlockedIncrement:

Function Call

Function Implementation

The function call is a regular call to a system API. The function implementation is just 5 instructions. The increment is done by a single instruction - the xadd instruction which has a lock prefix. The Intel Developer's Manual Vol2 says about xadd: "This instruction exchanges the first operand(destination) with the second operand(source), then loads the sum of the two values into the destination operand." So ecx contains the address of the memory variable to be incremented and eax contains 1 before the xadd instruction. The xadd instruction does this:

Assert LOCK signal
|
temp = eax + *ecx
|
eax = *ecx
|
*ecx = temp

We can see that even though it is a single assembly instruction, it involves three steps. The LOCK signal assertion is the one that ensures the instruction executes as an atomic operation.

Looking at the same document for the lock prefix: "Asserts LOCK# signal for duration of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted".

What this means is that the when the LOCK signal is asserted, no other processor or bus agent(could be an memory-mapped IO device) can read/write to the main memory. This ensures that the read-write operations of the variable to be incremented is atomic. All other processors and bus agents must wait until this instruction is complete before accessing the main memory. The very simplified diagram of the system architecture should make it easy to understand:

What If The Variable is Cached?

There is a possibility that the memory variable to incremented/decremented is cached in the processor's cache. In this case, the processor's cache coherency mechanism ensures that the operation is atomic. Intel's developers manual Vol3 states that recent processors do not even assert the LOCK signal if the variable is in the cache, instead uses just the cache locking (a technique to achieve cache coherency) mechanism.

A Look At The 64bit Implementation

The assembly code above is from a 32bit build of the source code. I was curious if the 64bit implementation was any different. So, changing the architecture in VisualStudio, compiling the code and I get this:

No function call?? What's going on? I then looked at the WinBase.h header and found that this function is declared with the __forceinline compiler directive. So this explains why there is no function call. Simply, the address of nRefs is loaded into rax and the inc instruction increments the variable. However, the InterlockedIncrement function is supposed to return the updated value as the function return value. Those two instructions don't do that. So I modified the source code to save the return value to a temporary variable to see what the compiler does:

long tmp = InterlockedIncrement(&nRefs);

This time I saw an implementation similar to the 32bit version but is inlined:

How Does Decrement Work?

There is no subtract equivalent of the xadd instruction, so what does the compiler do? As you see below, it is quite clever - it just uses xadd with -1 so essentially it achieves *ecx = *ecx + (-1).

Ring0 - The Inner Circle

Pages

October 16, 2014