In this post, I will discuss the implementation of the
InterlockedIncrement API (and its counterpart, the
InterlockedDecrement) found in Win32 SDK. I will also talk about how it is used in COM to track object lifetimes. This API is declared in
WinBase.h and exported by
kernel32.dll as specified in
MSDN. This API is one of the many
Interlocked functions available in the Win32 SDK. These APIs help in implementation of various synchronization mechanisms such as semaphores.
I first came across
InterlockedIncrement when going through a COM application's source code. Every COM class is derived from the
IUnknown interface and this interface has
AddRef and
Release functions. As specified in the MSDN article, these functions are together used to keep track of a COM object's lifetime by the reference counting mechanism. A COM server is designed such that a single server object can be used by more than one client. Whenever a COM client obtains a pointer to a COM server object, a call to
AddRef must be made and when the object is no longer needed, a call to
Release must be made. This way, the server can keep track of how many references to its object are present and can clean itself up when the reference count goes down to zero. The below snippet of pseudo-code should give you an idea of how this is implemented:
class ComServer : public IUnknown {
UINT nRefs;
ULONG AddRef() { InterlockedIncrement(&nRefs); } // ++nRefs
ULONG Release() { InterlockedDecrement(&nRefs); } // --nRefs
HRESULT QueryInterface(clsid, **ppObject)
{
...
AddRef();
return pointer to ComServer object
}
}
Since COM objects are designed to be used by multiple clients in a multi-threaded (and multi-processor) environment, the incrementing and decrementing of the counter variable (
nRefs in above example) must be controlled using some synchronization mechanisms. This is where the
InterlockedIncrement and
InterlockedDecrement functions come into picture.
Implementation
According to MSDN,
InterlockedIncrement "
Increments (increases by one) the value of the specified 32-bit variable as an atomic operation." and the
InterlockedDecrement function decreases the value of its argument by one. Let us see how these two functions ensure that operations on their arguments are atomic/synchronized. Looking at the assembly code of the function call and implementation of
InterlockedIncrement:
 |
Function Call |
 |
Function Implementation |
The function call is a regular call to a system API. The function implementation is just 5 instructions. The increment is done by a
single instruction - the
xadd instruction which has a
lock prefix. The
Intel Developer's Manual Vol2 says about
xadd: "This instruction exchanges the first operand(destination) with the second operand(source), then loads the sum of the two values into the destination operand." So
ecx contains the address of the memory variable to be incremented and
eax contains 1 before the
xadd instruction. The
xadd instruction does this:
Assert LOCK signal
|
temp = eax + *ecx
|
eax = *ecx
|
*ecx = temp
We can see that even though it is a single assembly instruction, it involves three steps. The LOCK signal assertion is the one that ensures the instruction executes as an atomic operation.
Looking at the same document for the
lock prefix: "Asserts LOCK# signal for duration of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted".
What this means is that the when the LOCK signal is asserted, no other processor or bus agent(could be an memory-mapped IO device) can read/write to the main memory. This ensures that the read-write operations of the variable to be incremented is atomic. All other processors and bus agents must wait until this instruction is complete before accessing the main memory. The very simplified diagram of the system architecture should make it easy to understand:
What If The Variable is Cached?
There is a possibility that the memory variable to incremented/decremented is cached in the processor's cache. In this case, the processor's cache coherency mechanism ensures that the operation is atomic. Intel's developers manual
Vol3 states that recent processors do not even assert the LOCK signal if the variable is in the cache, instead uses just the
cache locking (a technique to achieve cache coherency) mechanism.
A Look At The 64bit Implementation
The assembly code above is from a 32bit build of the source code. I was curious if the 64bit implementation was any different. So, changing the architecture in VisualStudio, compiling the code and I get this:
No function call?? What's going on? I then looked at the
WinBase.h header and found that this function is declared with the
__forceinline compiler directive. So this explains why there is no function call. Simply, the address of
nRefs is loaded into
rax and the
inc instruction increments the variable. However, the
InterlockedIncrement function is supposed to return the updated value as the function return value. Those two instructions don't do that. So I modified the source code to save the return value to a temporary variable to see what the compiler does:
long tmp = InterlockedIncrement(&nRefs);
This time I saw an implementation similar to the 32bit version but is inlined:
How Does Decrement Work?
There is no subtract equivalent of the xadd instruction, so what does the compiler do? As you see below, it is quite clever - it just uses xadd with -1 so essentially it achieves *ecx = *ecx + (-1).