Ring0 - The Inner Circle: Timing Your Code - Inside and Outside a Debugger

John Bentley talks about timing software code in his book Programming Pearls. I was recently reading this and thought I'd time my hashtable implementation(source). Windows has two APIs that can help in doing this: QueryPerformanceFrequency and QueryPerformanceCounter. There are other ways to measure time but I chose these for their high resolution. These APIs provide values from the high-performance counter in the system, if one is present. This counter is like a variable that keeps incrementing many many times per second. The frequency of increments depends on the specific system you have. MSDN says that sometimes this can be the cycle rate of the processor's clock.

QueryPerformanceFrequency returns this frequency value and this value does not change during the system's up-time operation. My system, for example, has a counter that ticks 2143623 times per second and this translates to a resolution of 466.499940 nanoseconds.

QueryPerformanceCounter returns the value of the counter when it is called. By calling this immediately before and after a block of code, we can calculate the number of increments that took place while the code was running and then convert this to a time value. Both the APIs take a pointer to a LARGE_INTEGER variable. If we look in WinNT.h, we see that this is an 8-byte structure with members for accessing each of the 4-byte values. 8-bytes gives plenty of bits to hold the counter value.

Coming to the timing of the hashtable implementation. I timed only the insertion operation using a DWORD type for both key and value. Ran the tests with multiple number of entries each time: 50000, 100000, 700000, 8388608(1<<23), 16777216(1<<24). I initially had a count of 33554432(1<<25) and this run failed because of an out-of-memory condition! Each process by default is limited to 2GB of main memory space. Part of the scaffolding code I used for this is here:

LARGE_INTEGER freq = {0};
LARGE_INTEGER counterStart = {0};
LARGE_INTEGER counterEnd = {0};

LONGLONG elapsedCounts = 0;
double elapsedTime[10] = {0};

int numRuns = 0;
int sizes[MAX_RUNS] = { 50000, 100000, 700000, 1<<23, 1<<24 };

DWORD dwError = 0;
BOOL testSuccess[MAX_RUNS] = {TRUE};

// This retrieves the counts per second
if(!QueryPerformanceFrequency(&freq))
{
    dwError = GetLastError();
    wprintf(L"testHashtable(): QueryPerformanceFrequency() failed %d\n", dwError);
    return;
}

while(numRuns < MAX_RUNS)
{
    wprintf(L"Beginning run %d\n", numRuns);

    // Begin counter
    if(!QueryPerformanceCounter(&counterStart))
    {
        dwError = GetLastError();
        wprintf(L"testHashtable(): QueryPerformanceCounter() failed %d\n", dwError);
        return;
    }

    //
    // Code to test
    //
    testSuccess[numRuns] = testHashtable(HT_KEY_DWORD, HT_VAL_DWORD, 
                                         sizes[numRuns], FALSE);

    // End counter
    if(!QueryPerformanceCounter(&counterEnd))
    {
        dwError = GetLastError();
        wprintf(L"testHashtable(): QueryPerformanceCounter() 
                  failed %d\n", dwError);
        return;
    }

    // Get the elapsed time
    elapsedCounts = counterEnd.QuadPart - counterStart.QuadPart;
    elapsedTime[numRuns] = (double)(elapsedCounts / (double)freq.QuadPart);
    ++numRuns;
}

wprintf(L"Performance counter ticks %I64u times per second\n", freq.QuadPart);
wprintf(L"Resolution is %lf nanoseconds\n", (1.0/(double)freq.QuadPart)*1e9);

wprintf(L"%16s %13s %19s %s\n-----------------------------------------------\n",
        L"RunSize", L"TimeSeconds", L"TimeMicro", L"Result");

for(numRuns = 0; numRuns < MAX_RUNS; ++numRuns)
{
    wprintf(L"%16d %5.8lf %16.3lf %s\n", sizes[numRuns], elapsedTime[numRuns], 
                                         elapsedTime[numRuns] * 1e6, 
                                         testSuccess[numRuns]?L"PASSED":L"FAILED");
}

This code gave the following output:

--[ Hashtable insertion operation ]--

** Release mode, running inside Visual Studio **
         RunSize   TimeSeconds           TimeMicro Result
---------------------------------------------------------------
           50000 3.25054825      3250548.254 PASSED
          100000 6.17252521      6172525.206 PASSED
          700000 45.90315601     45903156.012 PASSED
         8388608 648.03144909    648031449.093 PASSED
        16777216 1480.01800177   1480018001.766 PASSED
---------------------------------------------------------------

** Release mode, running outside in a command window **
         RunSize   TimeSeconds     TimeMicro Result
---------------------------------------------------------------
           50000 0.07000345        70003.447 PASSED
          100000 0.12546702       125467.025 PASSED
          700000 0.98070789       980707.895 PASSED
         8388608 11.25297219     11252972.188 PASSED
        16777216 19.48300751     19483007.506 PASSED

As you see from the results above, the interesting learning is the amount of increase in processing time when running under a debugger. In these tests, running under a debugger was orders of magnitude slower than otherwise. In order to see whether this really was the case with even simple code, I timed the execution of a system API - GetCurrentThread() using the same technique. In order to get consistent times I executed the system API multiple times in a for loop. Here are the results. We see the same increase in execution time. The RunSize is the number of times GetCurrentThread() was executed in TimeSeconds seconds. Do note that this API took only ~366 microseconds to execute 100,000 times on my Core i7 powered laptop. That is blazing fast!

--[ Multiple Calls to GetCurrentThread() ]--

** Release mode, inside Visual Studio **
         RunSize   TimeSeconds TimeMicro
---------------------------------------------------------
             500 0.00000746            7.464
           50000 0.00062698          626.976
          100000 0.00125349         1253.485

** Release mode, running outside in a command window **
         RunSize   TimeSeconds TimeMicro
---------------------------------------------------------
             500 0.00000187            1.866
           50000 0.00016747          167.473
          100000 0.00033635          336.346

This brings us to another interesting idea that is probably already being used. This technique can be used by a program to check whether it is running under a debugger or not. I recall reading somewhere that using IsDebuggerPresent() is not a fool-proof way of determining this. So timing checks may well be better. However, we must account for the fact that processing times vary widely depending on the system hardware and system load while running. What can be done is draw up a table with a list of expected processing times for a particular operation on various types of hardware configurations and for various system loads. Then when our program is running, we time the same operation and compare it against the list we have. The program must be able to determine the hardware configuration(CPUID instruction comes to mind) and the system load(can do this by querying Windows - #processes, memory used,...) and use this information to make a good comparison.

Ring0 - The Inner Circle

Pages

October 4, 2013

Timing Your Code - Inside and Outside a Debugger

No comments:

Post a Comment