Reputation: 61727
We have a problem on our website, seemingly at random (every day or so, up to once every 7-10 days) the website will become unresponsive.
We have two web servers on Azure, and we use Redis.
I've managed to run DotNetMemory and caught it when it crashes, and what I observe is under Event handlers leak
two items seem to increase in count into the thousands before the website stops working. Those two items are CaliEventHandlerDelegateProxy
and ArglessEventHandlerProxy
. Once the site crashes, we get lots of Redis exceptions that it can't connect to the Redis server. According to Azure Portal, our Redis server load never goes above 10% in peak times and we're following all best practises.
I've spent a long time going through our website ensuring that there are no obvious memory leaks, and have patched a few cases that went under the radar. Anecdotally, these seem to of improved the website stability a little. Things we've checked:
Our latest issue is that when the web server runs fine, but then we run DotNetMemory and attach it to the w3wp.exe process it causes the CaliEventHandlerDelegateProxy
and ArglessEventHandlerProxy
event leaks to increase rapidly until the site crashes! So the crash is reproducible just by running DotNetMemory. Here is a screenshot of what we saw:
I'm at a loss now, I believe I've exhausted all possibilities of memory leaks in our code base, and our "solution" is to have the app pools recycle every several hours to be on the safe side.
We've even tried upgrading Redis to the Premium tier, and even upgraded all drives on the webservers to SSDs to see if it helps things which it doesn't appear to.
Can anyone shed any light on what might be causing these issues?
Upvotes: 3
Views: 146
Reputation: 2508
All iDisposable objects are now wrapped in using blocks (we did this strictly before but we did find a few not disposed properly)
We can't say a lot about crash without any information about it, but I have some speculations about it. I see 10 000 (!) not disposed objects handled by finalization queue. Let start with them, find all of them and add Dispose call in your app. Also I would recommend to check how many system handles utilized by your application. There is an OS limit on number of handles and if they are exceeded no more file handles, network sockets, etc can be created. I recommend it especially since the number of not disposed objects.
Also if you have a timeout on accessing Redis get performance profiler and look why so. I recommend to get JetBrains dotTrace and use TIMELINE mode to get a profile of your app, it will show thread sleeping, threads contention and many many more information what will help you to find a problem root. You can use command line tool to obtain profile data, in order not to install GUI application on the server side.
it causes the CaliEventHandlerDelegateProxy and ArglessEventHandlerProxy event leaks to increase rapidly
dotMemory doesn't change your application code and doesn't allocate any managed objects in profiled process. Microsoft Profiling API injects a dll (written in c++) into the profiling process, it's a part of dotMemory, named Profilng Core, playing the role of the "server" (where standalone dotMemory written in C# is a client). Profiling Core doing some work with gathered data before sending it to the client side, it requires some memory, which allocated, of course, in the address space of the profiling process but it doesn't affect managed memory.
Memory profiling may affect performance of your application. For example, profiling API disables concurrent GC when application is under profiling or memory allocation data collecting can significantly slow down your application. Why do you thing that CaliEventHandlerDelegateProxy and ArglessEventHandlerProxy are allocated only under dotMemory profiling? Could you please describe how you explored this?
Event handlers are unsubscribed - there are very few in our code base
dotMemory reports an event handler as a leak means there is only one reference to it - from event source at there is no possibility to unsubscribe from this event. Check all these leaks, find yours at look at the code how it is happened. Anyway, there are only 110.3 KB retained by these objects, why do you decide your site crashed because of them?
I'm at a loss now, I believe I've exhausted all possibilities of memory leaks in our code base
Take several snapshots in a period of time when memory consumption is growing, open full comparison of some of these snapshots and look at all survived objects which should not survive and find why they survived. This is the only way to prove that your app doesn't have memory leak, looking the code doesn't prove it, sorry.
Hope if you perform all the activities I recommend you to do (performance profiling, full snapshots and snapshots comparison investigation, not only inspections view, checking why there are huge amount of not disposed objects) you will find and fix the root problem.
Upvotes: 4