Reputation: 3727

Debugging Stack Corruption issue

I'm debugging an "Access violation" exception on a large application in C++ (Visual Studio 2015). The application is built from several libraries and the problem occurs on one of them (SystemC), although I suspect the source of the problem is elsewhere.

What I see is a function-call that corrupts the address of a member function of the caller.

m_update_phase = true;
m_prim_channel_registry->perform_update();
m_update_phase = false;

inline
void
sc_prim_channel_registry::perform_update()
{
    for( int i = m_update_last; i >= 0; -- i ) {
    m_update_array[i]->perform_update();
    }
    m_update_last = -1;
}

(These are excerpts from systemc\kernel\sc_simcontext.cpp and systemc\communication\sc_prim_channel.h, part of the SystemC library)

The error happens after several iterations through this code above. The call to m_prim_channel_registry->perform_update() throws 0xC0000005: Access violation writing location 0x0F4CD9E9. exception.
This happens only when building the application in Release configuration.

Looking at the assembly code, I see that that the function sc_prim_channel_registry::perform_update() was inlined, and the inner function call m_update_array[i]->perform_update() seems to corrupt the stack frame of the calling function.
When the m_update_last = -1; is executed, &m_update_last no longer points to a valid memory location and the exception is thrown.
(m_update_last is a simple native member of class sc_prim_channel_registry with type int)

    m_update_phase = true;
    m_prim_channel_registry->perform_update();
1034D99E  mov         eax,dword ptr [esi+10h]  
1034D9A1  mov         byte ptr [esi+0A3h],1  
1034D9A8  mov         dword ptr [ebp-18h],eax  
1034D9AB  mov         ebx,dword ptr [eax+28h]  
1034D9AE  test        ebx,ebx  
1034D9B0  js          $LN163+0FEh (1034D9D0h)  
1034D9B2  mov         esi,eax  
1034D9B4  mov         eax,dword ptr [esi+20h]  
1034D9B7  mov         edi,dword ptr [eax+ebx*4]  
1034D9BA  mov         ecx,edi  
1034D9BC  mov         eax,dword ptr [edi]  
1034D9BE  call        dword ptr [eax+14h]  
1034D9C1  sub         ebx,1  
1034D9C4  mov         byte ptr [edi+1Ch],0  
1034D9C8  jns         $LN163+0E2h (1034D9B4h)  
1034D9CA  mov         esi,dword ptr [this]  
1034D9CD  mov         eax,dword ptr [ebp-18h]  
1034D9D0  mov         dword ptr [eax+28h],0FFFFFFFFh  
    m_update_phase = false;

The exception is thrown at address 1034D9D0 So the last instructions being executed are

0F97D9CD  mov         eax,dword ptr [ebp-18h]  
0F97D9D0  mov         dword ptr [eax+28h],0FFFFFFFFh

m_prim_channel_registry address is in [ebp-18h] and eax, and [eax+28h] is m_update_last.

Looking in the watch window at esp and ebp before the inner call perform_update(), I see that:

    ebp-18h 0x0022fd5c  unsigned int
    esp 0x0022fd60  unsigned int

This is strange. The difference between them is only 4 bytes and the next push to the stack will make them equal and overwrite [ebp-18h]!
[ebp-18h] holds a copy of this->m_prim_channel_registry. The call 1034D9BE call dword ptr [eax+14h], as it pushes the stack, corrupts the contents of ebp-18h. It looks like the stack has grown (downwards) too much, and corrupts the previous frame.

My questions are:

Am I analyzing the issue correctly? Did I miss something here?
What could cause such a corruption? I assume the issue is not related to either the compiler or the SystemC library, probably something that happened earlier someplace else.
What are the techniques for debugging such a corruption?

Update

I believe I found the problem, but I can't say I understand this completely.
In the same function (sc_simcontext::crunch) where the external perform_update() is invoked, systemc methods are invoked:

    // execute method processes

    sc_method_handle method_h = pop_runnable_method();
    while( method_h != 0 ) {
    try {
        method_h->execute();
    }
    catch( const sc_exception& ex ) {
        cout << "\n" << ex.what() << endl;
        m_error = true;
        return;
    }
    method_h = pop_runnable_method();
    }

These methods are deferred function calls registered earlier.
One of these methods was returning by executing ret 4 thus shrinking the stack frame every time it was called, to the point where the corruption described above happened.

And how did I manage registering a corrupted systemc method?
Apparently it's a bad idea using SC_METHOD(f) when f is a virtual function of the module. Doing that caused a different, unrelated "random" function to be called.
I'm not exactly sure why it happens this way and why this limitation exists. Also I don't remember seeing any warning about using virtual member functions as systemc methods, however it was definitely the problem. When debugging the method registration in the SC_METHOD call itself I could see the function pointer inside pointing to a different function than was given to the SC_METHOD macro.

To fix the problem I called SC_METHOD(wrapper_f), where wrapper_f is a simple non virtual member function of the module, that all it does is calling f, the original virtual function. That's it.

Upvotes: 5

Answers (2)

user1143634

Reputation:

You are probably having issues with member function pointers on MSVC.

Consider following code, file main.cpp:

#include <cstdio>

struct base;
typedef void (base::*baseptr_t)();

struct base {
    void foo() { }
};

void callfoo(base *obj, baseptr_t ptr);

int main()
{
    base obj;
    std::printf("sizeof(baseptr_t)=%llu\n", sizeof(baseptr_t));
    callfoo(&obj, &base::foo);
}

and file callfoo.cpp:

#include <cstdio>

struct base;
typedef void (base::*baseptr_t)();

void callfoo(base *obj, baseptr_t ptr)
{
    std::printf("sizeof(baseptr_t)=%llu\n", sizeof(baseptr_t));
    (obj->*ptr)();
}

On x86_64 this prints:

sizeof(baseptr_t)=8
sizeof(baseptr_t)=24

before crashing with access violation.

This is because MSVC generates 8-byte pointers for classes with known definition, but has to generate 24-byte pointers if class definition is not available.

Compiler has ways to control this behavior:

PS: I wasn't able to reproduce this, but you can also check sc_process.h header from SystemC, it has following lines:

#if defined(_MSC_VER)
#if ( _MSC_VER > 1200 )
#   define SC_USE_MEMBER_FUNC_PTR
#endif
#else
#   define SC_USE_MEMBER_FUNC_PTR
#endif

You can try to undefined this macro for your build, in this case SystemC will try to use different strategy when calling process function.

PS2: Member function pointer size can be 8, 16 and 24 bytes in size depending on its hierarchy, so there should be 3 ways to dereference member function pointer, plus each way has to handle virtual and non-virtual calls.

Upvotes: 3

Israel Unterman

Reputation: 13510

It seems you know what you are doing.

I can give you an advice, not a solution, but it is something that I encountered more than a few times, that corrupts the stack.

Check the the function causing the corruption, perform_update(). Does it defines a big array as a local variable? If so, it probably exceeds the stack and overrides the return data and other important data there. This is the most common problem I encountered for stack corruption.

It is a sneaky problem because it depends on the size of the local array and the amount of stack you have. This changes from system to system.

Upvotes: 0

Debugging Stack Corruption issue

What I see is a function-call that corrupts the address of a member function of the caller.

My questions are:

Update

Answers (2)

Related Questions