Reputation: 635
This question is very similar to: windbg memory leak investigation - missing heap memory
Except that in my case everything is x86, whereas the answer offered on that post says that Windbg x64 is broken.
In my case, when I do "!heap -s" I get:
************************************************************************************************************************
NT HEAP STATS BELOW
************************************************************************************************************************
LFH Key : 0x653c3365
Termination on corruption : DISABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
00e70000 00000002 761224 757296 761012 6306 1149 51 0 572b1 LFH
00d60000 00001002 1292 128 1080 26 7 2 0 0 LFH
01050000 00001002 1292 1048 1080 271 18 2 0 31 LFH
..snip..
Where I am interested in the heap at 00e70000.
Next when I execute the command: !heap -stat -h 00e70000 -grp s 0n999
I get 509 lines of output for every block in that heap listing its groupsize, number of blocks that match that size, and the total size of memory used by all of the blocks of that size. Partial output is:
0:000> !heap -stat -h 00e70000 -grp s 0n999
heap @ 00e70000
group-by: TOTSIZE max-display: 999
size #blocks total ( %) (percent of total busy bytes)
1000 14a - 14a000 (20.24)
600c 16 - 84108 (8.10)
168 408 - 5ab40 (5.56)
154 404 - 55550 (5.23)
10d8 2a - 2c370 (2.71)
24 113f - 26cdc (2.38)
22750 1 - 22750 (2.11)
I then paste all that into excel, and convert the 3rd column to decimal and sum it up, and only get a total of 6.5 meg, or thereabout.
Both !address -summary, as well as !heap -s are indicating that I should be getting a sum of something in the neighborhood of 808 meg total. This is leading me to think that either I don't understand the units of the -stat command, or perhaps both x64 and x86 (entire Windbg) is broken, or that I have a more fundamental misunderstanding.
Could someone help me to understand which is the case?
Thanks!
Edit: Additional Information Using DebugDiag, I see the main (default) heap has 46/54 segments that all share a common feature, they are all 15.81 meg in size, and all of them are nearly completely allocated. That represents the total difference I am missing.
After seeing this, I recall that our native code is using FASTMM4, which probably accounts for both those segments as well as why I am not getting those objects inside them listed by Windbg.
Therefore, I am planning to remove FASTMM4 from the native code, and run the perf test again to see if this changes things. Please feel free to add anything helpful regarding this.
Second Edit, Additional Information: After removing FASTMM from our code base and re-running the tests, I see that the 15.81 MByte segments are still there and still leaking. These can be seen in a DebugDiag analysis as:
Segment Information
Base Address Reserved Size Committed Size Uncommitted Size Number of uncommitted ranges Largest uncommitted block Calculated heap fragmentation
0x00e70000 1020 KBytes 1020 KBytes 0 Bytes 1 0 Bytes 0% 0
0x03be0000 1020 KBytes 1020 KBytes 0 Bytes 1 0 Bytes 0% 0
0x04a20000 2 MBytes 2 MBytes 0 Bytes 1 0 Bytes 0% 0
0x051e0000 4 MBytes 4 MBytes 0 Bytes 1 0 Bytes 0% 0
0x0c4b0000 8 MBytes 8 MBytes 0 Bytes 1 0 Bytes 0% 0
0x19dc0000 15.81 MBytes 15.78 MBytes 28 KBytes 1 28 KBytes -11928.57% Unavailable
0x1c3b0000 15.81 MBytes 15.81 MBytes 0 Bytes 1 0 Bytes 0% 0
0x2c900000 15.81 MBytes 15.81 MBytes 0 Bytes 1 0 Bytes 0% 0
..snip..
where the new sections shown at the bottom marked as 15.81 MBytes extend for an additional 46 new segments and represent 727.26 mb of leaked memory on the unmanaged heaps.
Searching on the value of 15.81 MBytes leads me to a few various citations relating to the Microsoft VC Runtime:
https://social.msdn.microsoft.com/Forums/vstudio/en-US/e7534d01-57ed-455c-bc0d-edb1b87d0f52/microsoft-vc-runtime-heap-fragmentation?forum=vclanguage https://learn.microsoft.com/en-us/visualstudio/debugger/crt-debug-heap-details?view=vs-2019 Debugdiag shows "Microsoft VC Runtime Heap" using over 1gb
Using Windbg, I can display allocation information regarding the allocations, which appear as follows:
61f130c8: 08008 . 10008 [101] - busy (10000) Internal
61f230d0: 10008 . 10008 [101] - busy (10000) Internal
61f330d8: 10008 . 10008 [101] - busy (10000) Internal
61f430e0: 10008 . 08008 [101] - busy (8000) Internal
61f4b0e8: 08008 . 10008 [101] - busy (10000) Internal
61f5b0f0: 10008 . 10008 [101] - busy (10000) Internal
However, since they are marked as "Internal" they do not participate in the "stack back traces" (gflags option -ust) to determine the actual code that was executed to allocate them.
Can anyone direct me toward any additional information about this leak? It does eventually cause our application to crash. I need anything that can guide me to determining a method of how we can affect it to reduce or eliminate this leak.
Upvotes: 2
Views: 2445
Reputation: 635
I am posting this as an answer, because eventually several methods did lead us to the source of an unmanaged leak in our code. Some of what I will say here is mere hypothesis, because I have not found anything in Microsoft documentation to verify it.
In the original post I showed a section of a DebugDiag analysis that depicted an ever growing number of 15.81 segments in the default heap of the process. I have come to believe that this is just Windows's way of allowing a heap (in a system of possibly many heaps) to grow without making any assumptions about which heap will need to grow in a heavy load situation. They seem to get created with a 1mb segment, then another, then 2mb segment, then 4mb, 8mb, and 16mb. Thereafter, they only grow by 16mb (i.e. 15.81) as needed.
When the native heap is leaking, segments get added over and over like that.
Before the questions here had even started, we did a managed memory analysis using dump files made at various points of an extended 92 hour load test. We used both Visual Studio as well as Windbg's SOS commands, and found no "managed" growth. The only problem was that the unmanaged code was leaking as shown in the original post.
At that point we used "Gflags +ust" on the process to get stack-back-traces. This gave us entirely valid information, but insufficient data. It showed a large number of leaked blocks and claimed that they were allocated by the SecureString.ctor. Not seeing any SecureStrings (either alive or dead) on the managed heap, we chose to disregard what it was telling us at that time.
We then adopted a more low-tech process of locating the leaking code. We would test each API call in separate extended load tests and take dump analysis from DebugDiag until we either saw that it was leaking or accepted that it was not.
Once we found the API that was leaking we modified the server to essentially "gut-out" large tracks of the code it was executing, and repeated the debugdiag analysis until our "gutted-code" no longer showed a leak.
At that point we started returning sections of it, repeatedly taking measurements, and using either one of two methods on the server to 1) eliminate that code path by commenting it out, or 2) exacerbating it by putting a loop 1=1 to 1000 (or something appropriate to magnify the leak)
Once a code path was proven to leak, we drilled down into its scope and repeated that process as a kind of binary search. Eventually this lead us to 3 lines the first of which was (not coincidentally) allocating a "SecureString".
This was being passed to a method to decode it, which contained approximately this code:
//Convert to IntPtr using marshal
IntPtr tmp = Marshal.SecureStringToBSTR(SecureString_Param1);
//convert to string using marshal
string plain = Marshal.PtrToStringAuto(tmp);
//Return the now plain string
return plain;
The leaked memory was the unmanaged BSTR, which had, in fact, originally been allocated by the SecureString.ctor. This code was tested in a separate test application to verify.
Please feel free to add any comments to this post.
Upvotes: 3