Pages

October 23, 2015

Using CRITICAL_SECTION internals to debug

I was recently faced with debugging a not-very-frequently-occurring bug which was causing a crash in a user-mode application whenever it was hit. I had couple of memory dumps from the crashes. Looking at the source code didn't get me anywhere because I couldn't trace through any code path that could cause this. This was C++ code and I noticed that the crash was an access violation in a class member function, because 'this' object was already deleted while the member function was still on the stack.

For the sake of this post, I'm going to call the class BadClass and the access violating member function funcAV(). BadClass has a critical section data member that I'll call m_csBadClass. funcAV was entering the critical section as its very first line of code, then checking if the m_fShtudown boolean data member was set and leaving the critical section just before returning from the function. The function called several other functions while inside the critical section. The class also had a shutdown() function that was always invoked before deletion and that entered the same critical section, set the m_fShutdown flag and returned.

void BadClass::funcAV() {
    EnterCriticalSection(&m_csBadClass)
    if (!m_fShutdown) {
        // do some work
        // call other functions
        // use data member, say, m_intA <-- access violation reading this
    }
    LeaveCriticalSection(&m_csBadClass)
}

void BadClass::shutdown() {
    EnterCriticalSection(&m_csBadClass)
    m_fShutdown = true
    LeaveCriticalSection(&m_csBadClass)    
}

Since the shutdown() function was always called before object deletion and knowing that shutdown() entered the m_csBadClass critical section, I concluded that the object was deleted by the same thread where funcAV() is executing because the object was deleted while funcAV() was inside the critical section. That is, some callee-of-callee-of-funcAV was deleting the object. Remember that critical sections can be entered recursively by the same thread. You must be wondering now, "Dude, why can't you just place a breakpoint/assertion in the class destructor to see who was calling the destructor?". I could but not in this case because there were legitimate cases where the destructor is called and everything is fine in this scenario. I had to break only when the object was being deleted while funcAV() was still on the callstack and was inside the critical section.

Knowing that the only way for the object of BadClass to get deleted is if the critical section was being entered recursively, I placed an assertion in BadClass::shutdown() like this:

void BadClass::shutdown() {
    EnterCriticalSection(&m_csBadClass)
    NT_ASSERT(m_csBadClass.RecursionCount < 2)
    m_fShutdown = true
    LeaveCriticalSection(&m_csBadClass)    
}

Then I let the code run on our test environment overnight and I had couple of crash dumps the next morning. These new crashes, to my relief, were from the assertion! Now I got to know the exact code path that was causing the object to get deleted and the fix was easy knowing the root cause. The code path was something like this:

ntdll!breakpoint <-- for the assertion
myDll!BadClass::shutdown
... couple of other functions
myDll!ClassC::func2
myDll!ClassB::func1
myDll!BadClass::funcAV
ntdll!thread-start-function

So knowing the internals of the CRITICAL_SECTION structure certainly made figuring out this bug easy. Other fields in the structure also offer a lot of insight that may one day be useful -

struct _RTL_CRITICAL_SECTION {
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;
    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;
}

DebugInfo - Pointer to a RTL_CRITICAL_SECTION_DEBUG structure which gives more information about this critical section. This structure is actually part of a linked list whose elements are the debug structures of all critical sections used in the current process. This will help in detecting deadlocks during runtime and for other debugging purposes.

LockCount - This field name is a little misleading because it actually more than one datum.
  • Bit0 - Whether the critical section is unlocked (1) or locked (0)
  • Bit1 - Whether the critical section has woken up a waiting thread (0) or not (1)
  • Bit2 thru 31 - #threads waiting on this critical section
RecursionCount - How many times has this thread entered this critical section recursively.

OwningThread - ThreadID of the thread that currently owns this critical section. Zero if the critical section is unlocked, i.e., not owned by any thread.

LockSemaphore - Misleading field name. Usually zero but in cases where it isn't zero, it is a HANDLE to an event used by the kernel to signal a waiting thread when another thread leaves this critical section. That is, the kernel internally uses this event to wake up one of the 'waiting' threads when the contented critical section becomes free. I've observed that this always -1 (see example below) after introducing a contention situation so I will dig up some more info on how exactly this can be an event handle.

SpinCount - Putting a thread into waiting state during contention for the critical section is quite an expensive operation because it involves transitioning into kernel mode. If we know that a particular critical section will be held for very very short periods at a time but multiple threads use this critical section, the SpinCount can be given a valid positive number, say 1000000. This indicates that when there is contention, the thread must enter a loop until the critical section is free or the number of loop iterations reaches the SpinCount value. This is essentially a busy-wait in user-mode to see if the contention gets resolved. This avoids transitioning into kernel-mode for a critical section that could get released in the next moment.

Sample program to see these some of these fields in action:


#include <Windows.h>
#include <stdio.h>

#define NUM_CONTENDING_THREADS      3

#define CS_LOCKCOUNT_UNLOCKED(lc)   ((lc) & 0x00000001)             // bit0
#define CS_LOCKCOUNT_TH_WOKEN(lc)   (((lc) & 0x00000002) >> 1)      // bit1
#define CS_LOCKCOUNT_CONTENTION(lc) (((~(lc)) & 0xfffffffc) >> 2)   // bit2 to 31


void DumpCSInfo(PCWSTR pszMsg, PCRITICAL_SECTION pcs)
{
    wprintf(L"[%5u] %-16s contention = %u, locked = %-3s, woken = %-3s, lockSem = 0x%p\n",
        GetCurrentThreadId(),
        pszMsg,
        CS_LOCKCOUNT_CONTENTION(pcs->LockCount),
        CS_LOCKCOUNT_UNLOCKED(pcs->LockCount) ? L"no" : L"yes",
        CS_LOCKCOUNT_TH_WOKEN(pcs->LockCount) ? L"no" : L"yes",
        pcs->LockSemaphore);
}

DWORD WINAPI GrabCriticalSection(LPVOID lpParameter)
{
    PCRITICAL_SECTION pcs = (PCRITICAL_SECTION)lpParameter;
    
    DumpCSInfo(L"Waiting", pcs);

    EnterCriticalSection(pcs);
    
    DumpCSInfo(L"Inside", pcs);
    Sleep(5000);

    LeaveCriticalSection(pcs);
    DumpCSInfo(L"Left", pcs);

    return ERROR_SUCCESS;
}

int main()
{
    DWORD tid;
    HANDLE hThreads[NUM_CONTENDING_THREADS];
    
    CRITICAL_SECTION cs;

    WCHAR szMsg[16];

    InitializeCriticalSection(&cs);
    DumpCSInfo(L"Init", &cs);

    EnterCriticalSection(&cs);
    DumpCSInfo(L"Inside", &cs);

    for (int i = 0; i < ARRAYSIZE(hThreads); ++i)
    {
        hThreads[i] = CreateThread(NULL, 0, GrabCriticalSection, &cs, CREATE_SUSPENDED, &tid);
    }

    for (int i = 0; i < ARRAYSIZE(hThreads); ++i)
    {
        ResumeThread(hThreads[i]);
        Sleep(1000);
        swprintf_s(szMsg, ARRAYSIZE(szMsg), L"R th %d", i);
        DumpCSInfo(szMsg, &cs);
    }

    Sleep(2000);
    LeaveCriticalSection(&cs);
    DumpCSInfo(L"Left", &cs);

    WaitForMultipleObjects(ARRAYSIZE(hThreads), hThreads, TRUE, INFINITE);

    DumpCSInfo(L"Exiting", &cs);
    return 0;
}

Output:

[10192] Init     contention = 0, locked = no , woken = no , lockSem = 0x0000000000000000
[10192] Inside   contention = 0, locked = yes, woken = no , lockSem = 0x0000000000000000
[ 5676] Waiting  contention = 0, locked = yes, woken = no , lockSem = 0x0000000000000000
[10192] R th 0   contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Waiting  contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] R th 1   contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Waiting  contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] R th 2   contention = 3, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] Left     contention = 2, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[ 5676] Inside   contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 5676] Left     contention = 1, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Inside   contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Left     contention = 0, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Inside   contention = 0, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Left     contention = 0, locked = no , woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] Exiting  contention = 0, locked = no , woken = no , lockSem = 0xFFFFFFFFFFFFFFFF