Reputation: 13170
How can I find details of the Windows C++ memory allocator that I am using?
Debugging my C++ application is showing the following in the call stack:
ntdll.dll!RtlEnterCriticalSection() - 0x4b75 bytes
ntdll.dll!RtlpAllocateHeap() - 0x2f860 bytes
ntdll.dll!RtlAllocateHeap() + 0x178 bytes
ntdll.dll!RtlpAllocateUserBlock() + 0x56c2 bytes
ntdll.dll!RtlpLowFragHeapAllocFromContext() - 0x2ec64 bytes
ntdll.dll!RtlAllocateHeap() + 0xe8 bytes
msvcr100.dll!malloc() + 0x5b bytes
msvcr100.dll!operator new() + 0x1f bytes
My multithreaded code is scaling very poorly, and profiling through random sampling indicates that malloc is currently a bottleneck in my multithreading code. The stack seems to indicate some locking going on during memory allocation. How can I find details of this particular malloc implementation?
I've read that Windows 7 system allocator performance is now competitive with allocators like tcmalloc and jemalloc. I am running on Windows 7 and I'm building with Visual Studio 2010. Is msvcr100.dll the fast/scalable "Windows 7 system allocator" often referenced as "State of the Art"?
On Linux, I've seen dramatic performance gains in multithreaded code by changing the allocator, but I've never experimented with this on Windows -- thanks.
Upvotes: 0
Views: 1762
Reputation: 38941
am simply asking what malloc implementation I am using with maybe a link to some details about my particular version of this implementation.
The callstack you are seeing indicates that the MSVCRT (more exactly, it default operator new
=> malloc
are calling into the Win32 Heap functions. (I do not know whether malloc
routes all requests directly to the CRT's Win32 Heap, or whether it does some additional caching - but if you have VS, you should have the CRT source code too, so should be able to check that.) (The Windows Internals book also talk about the Heap.)
General advice I can give is that in my experience (VS 2005, but judging from Hans' answer on the other question VS2010 may be similar) the multithreaded performance of the CRT heap can cause noticeable problems, even if you're not doing insane amounts of allocations.
That RtlEnterCriticalSection
is just that, a Win32 Critical Section: Cheap to lock with low contention, but with higher you will see suboptimal runtime behaviour. (Bah! Ever tried to profile / optimize code that coughs on synchronization performance? It's a mess.)
One solution is to split the heaps: Using different Heaps has given us significant improvements, even though each heap still is MT enabled (no HEAP_NO_SERIALIZE
).
Since you're "coming in" via operator new
, you might be able to use different allocators for some of the different classes that are allocated often. Or maybe some of your containers could benefit from custom allocators (that then use a separate heap).
One case we had, was that we were using libxml2 for XML parsing, and while building up the DOM tree, it simply swamps the system in malloc calls. Luckily, it uses its own set of memory allocation routines that can be easily replaced by a thin wrapper over the Win32 Heap functions. This gave us huge improvements, as XML parsing didn't interfere with the rest of the system's allocations anymore.
Upvotes: 2