Ring0 - The Inner Circle: October 2014

October 27, 2014

How Big Are Your Functions?

Couple of weeks back, there was a need to find out the code size of functions in a program I had written. I was just curious as to which functions were the biggest space hoggers. By 'code size of functions', I mean the number of bytes a function's machine code occupy in the .text section of the binary file (or in memory when loaded). This does not include the stack/heap space used by the function. Doing this exercise will come in handy when there is a need to reduce your program's memory footprint; in other words you want to achieve space optimization. Of course, optimizing for space includes stack/heap optimization along with code size reductions. Keep in mind that code size does affect a program's memory footprint if you are using a lot of macros and inline functions.

Disassembler to the rescue!

First thing that came to my mind was to use dumpbin to disassemble the binary code and find out the code size of a function by subtracting the function start address and the end address. This works fine but is tedious if you want to measure the size of multiple (or all) functions within a program.

?DestroyDirInfo_NoHash@@YAXPAU_DirectoryInfo@@@Z:
00413560: 55 push ebp
00413561: 8B EC mov ebp,esp
00413563: 81 EC D8 00 00 00 sub esp,0D8h
...
...
00413662: 8B E5 mov esp,ebp
00413664: 5D pop ebp
00413665: C3 ret

This is a snippet from the disassembly of the function DestroyDirInfo_NoHash, one of the functions in FDiffDelete program. So we can calculate the code size as:
0x00413665 - 0x00413560 + 1 = 0x106 bytes (or 262 bytes)

DIA, I've got my eye on you

Sometime back, a colleague mentioned that the Debug Interface Access (DIA) SDK can be used for the same purpose. I didn't get more details from him that day and the topic just slipped my mind until this weekend. I started digging around MSDN to find out how I can use the DIA SDK to find out the functions' code size.

Reading through the articles, I learned that the DIA is an abstraction layer over PDB files and is implemented as an in-proc COM server. A PDB file is the Program DataBase file that holds debug information for a binary file - stuff like local variable address, function address, etc. So I thought - 'Yes, this could give me the code size of functions'. Digging further into the articles, I found that what you need is:

Path to the PDB file corresponding to the exe/dll file whose functions you want to analyze.
(OR) Path to the exe/dll file and optionally a search path where to look for the corresponding PDB file.

With the above information in hand, the process of enumerating all functions in the binary and finding out their size is quite straight-forward. Only hurdle is the usage of COM specific stuff if you aren't familiar with it. So the basic outline of using DIA for our 'code sizing' purpose is shown below:

I've uploaded this code to github here: CodeSizer. It shows the undecorated function names and their size in bytes in descending order of size. It also has options to show all functions or specific functions or functions that match the specified sub-string.

Going back to our earlier example of calculating the size of the function DestroyDirInfo_NoHash, see the output from CodeSizer, the size is 262 bytes:
> CodeSizer.exe
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe
> /s DestroyDirInfo_NoHash
Function Name Size In Bytes
------------- -------------
DestroyDirInfo_NoHash 262
Enumerated #functions = 1

Sizes of all functions that have 'Build' as a sub-string:
> CodeSizer.exe
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe
> /r Build
Function Name Size In Bytes
------------- -------------
BuildFilesInDir_NoHash 2620
BuildFilesInDir_Hash 2350
BuildDirTree_Hash 938
BuildDirTree_NoHash 938
BuildDirInfo 889
BuildFilesInDir 107
BuildDirTree 99
Enumerated #functions = 7

Thus, a manual task becomes an easy automation that can be used frequently in future. This is what we programmers are good at, isn't it? We are lazy, and that's a good thing, in a way!

Peeking Into cryptbase.dll

I was curious to look at the sizes of some system DLL and chose to look into the AES functions in cryptbase. Strange thing I noticed when doing this is I couldn't get function information from this dll (or ntdll.dll either) by using the same technique as for my own DLLs. I had to look for public symbols in the PDB's global scope and then filter in only the functions. My guess is that system DLLs do not have debug information for non-exported functions. Here are the top 10 largest AES functions from cryptbase.dll:

Function Name Size In Bytes

------------- -------------

AesExpandKey 864

AesCbcDecrypt 832

AesCbcEncrypt 752

AesDecrypt 656

AesEncrypt 640

AesCtrRng_Generate 448

AesCtrRng_Instantiate 292

AesCtrRng_Update 292

AesCtrRng_Reseed 216

AesCtr_safe_startup 172

What's Next?

The DIA SDK, although very powerful, is quite tedious to use because it involves a lot of function calls to get stuff out. So, building an abstraction layer on top of it such as in the CodeSizer project will make it simple. Such an abstraction could be used as the DIA front-end in a debugger since the debugger makes heavy use of PDB files in order to show debug information.

October 16, 2014

Interlocked Instructions

In this post, I will discuss the implementation of the InterlockedIncrement API (and its counterpart, the InterlockedDecrement) found in Win32 SDK. I will also talk about how it is used in COM to track object lifetimes. This API is declared in WinBase.h and exported by kernel32.dll as specified in MSDN. This API is one of the many Interlocked functions available in the Win32 SDK. These APIs help in implementation of various synchronization mechanisms such as semaphores.

I first came across InterlockedIncrement when going through a COM application's source code. Every COM class is derived from the IUnknown interface and this interface has AddRef and Release functions. As specified in the MSDN article, these functions are together used to keep track of a COM object's lifetime by the reference counting mechanism. A COM server is designed such that a single server object can be used by more than one client. Whenever a COM client obtains a pointer to a COM server object, a call to AddRef must be made and when the object is no longer needed, a call to Release must be made. This way, the server can keep track of how many references to its object are present and can clean itself up when the reference count goes down to zero. The below snippet of pseudo-code should give you an idea of how this is implemented:

class ComServer : public IUnknown {
UINT nRefs;
ULONG AddRef() { InterlockedIncrement(&nRefs); } // ++nRefs
ULONG Release() { InterlockedDecrement(&nRefs); } // --nRefs
HRESULT QueryInterface(clsid, **ppObject)
{
...
AddRef();
return pointer to ComServer object
}
}

Since COM objects are designed to be used by multiple clients in a multi-threaded (and multi-processor) environment, the incrementing and decrementing of the counter variable (nRefs in above example) must be controlled using some synchronization mechanisms. This is where the InterlockedIncrement and InterlockedDecrement functions come into picture.

Implementation

According to MSDN, InterlockedIncrement "Increments (increases by one) the value of the specified 32-bit variable as an atomic operation." and the InterlockedDecrement function decreases the value of its argument by one. Let us see how these two functions ensure that operations on their arguments are atomic/synchronized. Looking at the assembly code of the function call and implementation of InterlockedIncrement:

Function Call

Function Implementation

The function call is a regular call to a system API. The function implementation is just 5 instructions. The increment is done by a single instruction - the xadd instruction which has a lock prefix. The Intel Developer's Manual Vol2 says about xadd: "This instruction exchanges the first operand(destination) with the second operand(source), then loads the sum of the two values into the destination operand." So ecx contains the address of the memory variable to be incremented and eax contains 1 before the xadd instruction. The xadd instruction does this:

Assert LOCK signal
|
temp = eax + *ecx
|
eax = *ecx
|
*ecx = temp

We can see that even though it is a single assembly instruction, it involves three steps. The LOCK signal assertion is the one that ensures the instruction executes as an atomic operation.

Looking at the same document for the lock prefix: "Asserts LOCK# signal for duration of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted".

What this means is that the when the LOCK signal is asserted, no other processor or bus agent(could be an memory-mapped IO device) can read/write to the main memory. This ensures that the read-write operations of the variable to be incremented is atomic. All other processors and bus agents must wait until this instruction is complete before accessing the main memory. The very simplified diagram of the system architecture should make it easy to understand:

What If The Variable is Cached?

There is a possibility that the memory variable to incremented/decremented is cached in the processor's cache. In this case, the processor's cache coherency mechanism ensures that the operation is atomic. Intel's developers manual Vol3 states that recent processors do not even assert the LOCK signal if the variable is in the cache, instead uses just the cache locking (a technique to achieve cache coherency) mechanism.

A Look At The 64bit Implementation

The assembly code above is from a 32bit build of the source code. I was curious if the 64bit implementation was any different. So, changing the architecture in VisualStudio, compiling the code and I get this:

No function call?? What's going on? I then looked at the WinBase.h header and found that this function is declared with the __forceinline compiler directive. So this explains why there is no function call. Simply, the address of nRefs is loaded into rax and the inc instruction increments the variable. However, the InterlockedIncrement function is supposed to return the updated value as the function return value. Those two instructions don't do that. So I modified the source code to save the return value to a temporary variable to see what the compiler does:

long tmp = InterlockedIncrement(&nRefs);

This time I saw an implementation similar to the 32bit version but is inlined:

How Does Decrement Work?

There is no subtract equivalent of the xadd instruction, so what does the compiler do? As you see below, it is quite clever - it just uses xadd with -1 so essentially it achieves *ecx = *ecx + (-1).

October 7, 2014

So You Want to Write An API?

The topic of this post is the design of an API. These are things I learnt in the course of writing the CHelpLib library. As most programmers know, API stands for Application Programming Interface. An API consists of a set of functions and data structures that provide some functionality to the user of the API. The most common example is the interface offered by an OS. All functions that are part of the OS, exposed to kernel mode applications like device drivers and user mode applications such as MS Word, make up the API of the OS. We all know the general guidelines around writing functions and designing data structures - clear functionality, keeping it simple, following naming conventions and so on. Many good programmers do follow these, of course, in developing applications but these are probably much more important in the design and implementation of an API because an API has the potential to be used much more widely since it is exposed to the public. Compare this to any module within a particular product, which is only used within the product itself and only the developers working on that product need to understand it.

Some points that I incorporated into the CHelpLib library's APIs:

Consistent experience in the usage of all the APIs within the library. This has two aspects: Consistent API usage experience across all functions within the library and consistency with the industry or platform specific conventions.
Consistency with the industry/platform makes it easy for the programmer using the API. In my case, this was the Windows platform programming conventions since I was developing for this platform. Consistency of the API includes:

Object(data structures) creation, usage and disposal semantics must be similar across the library.
Function signatures and naming must be consistent and follow the Windows API model.
All functions must be consistent in the way they handle memory allocation/deallocation. This could either be left to the caller or usually shared between the caller and the callee.
All functions within the API must follow the same method of error/failure reporting to the caller.

A public API must be very well documented and the documentation must be up to date. This means the documentation must include what exactly each function does, how it is supposed to be used, what are the failure scenarios and so on.
All known bugs must be documented and made visible to the users.
An API/library is meant to provide functionality with as less complexity in usage as possible. To this end, the API must expose very little implementation details (this must be documented, however).
If the API consists of functions that provide functionality such as algorithms and data structures: the algorithm design, memory requirements and expected performance should be specified so that the users can make better choices and use the API in a better way.

The most important changes I made to the CHelpLib library in order to incorporate the above points:

Simplifying the naming of all functions and avoiding naming collisions with system functions. All function names follow this pattern: "CHL_{module_name}{function_name}". For example, CHL_DsCreateHT - 'Ds' indicates DataStructures, 'CreateHT' indicates that the function creates a hashtable.
Standardizing the error/failure reporting using HRESULT as the return type of most functions. This type is the recommended return type on the Windows platform.
Since return type is used only to indicate function success or failure, programmatic return values are passed back to caller via 'out' parameters, i.e., using pointer parameters. For example: CHL_DsCreateHT(**pHTableOut, nEstEntries, keyType, valType, fValInHeapMem). This function creates a hashtable and returns a pointer to the hashtable object in the first parameter pHTableOut.
Each module has its own header file. API users only need to include the used module's header file. Earlier, all exported functions and data structures were specified within a single header file.
Usage of Microsoft's SAL2(Source Annotation Language) to annotate the function parameters. This enables better static code checking and also conveys the meaning of the parameters better.
The data structures - hashtable, linkedlist and queue - all have identical support for storage and retrieval of values (and keys).
More test code was written to test the new changes and older tests provided the necessary regressions.
Writing documentation for the full API. This is still a work in progress.

The code for CHelpLib along with the Test code and documentation can be found at Github: https://github.com/shishir993/CHelpLib. Future work for this particular library will see additions in the area of data structures, algorithms and some stuff around Windows process access APIs.

Pages