Reputation: 736
The environment in question is a multi-tenant environment. It is made up of 2 app service plans in the same resource group; one of which contains a .NET Framework Web API, the other contains the front-end facing client (which is an ASP.NET Web Forms app). Both plans are running on the P1V3 subscription (195 min ACU/VCPU, 8GB memory, 2VCPU). The reason we separated them into different plans was to be able to apply more tailor-made scale out settings (but scale out settings are not the reason why I am posting this question).
Lately we've been working on several performance improvements. We've pinpointed how certain dependencies were creating a lot of waiting time and at points destabilising the system. Then re-designed / re-structured a lot of aspects to mitigate these limitations. We think we've got quite a decent improvement. However, we're still experiencing a gradual memory increase in the App Service plan (Web API) that never recovers.
I believe that over time, this is leading to a degradation of service. For e.g. loading the landing page (which requires loading some user settings) without any heavy usage takes around 2 seconds; but ends up taking between 6-10 seconds when the system has been used for quite a long time (and would have consumed considerable memory).
I tried monitoring the "Committed Memory" from the Application Insights/ Live Metrics;
after start up I've seen the "Committed Memory" metric stabilising on 258mb. Then I started sending continuous requests for about 40 mins (sent 270 requests to the API) and the memory increased to 529mb.
I stopped the requests and kept monitoring; and the memory continued increasing little by little; without any actual usage. The only request that would be sent is a 5 minute refresh from Azure (as the app is set to not idle out).
What I've tried;
I started disposing of large objects that are initialised for common requests and triggered GC.Collect() in the Dispose Overrides method at API Controller level
I focused on 3 specific requests for the time being (the ones being used in my test above) and implemented clearing of the main objects/ data tables within the Try/Catch/Finally block.
My questions:
Is the "Committed Memory" in the Application Insights / Live Metrics section the "currently in use" memory? If so, does this constitute a memory leak? (From the requests I am purposely focusing on, it is in my opinion very unlikely a memory leak, as they are quite simple and we are not utilising unmanaged code or so)
What could be the reason for such behaviour? Is this normal behaviour in Azure App Service environments?
Is perhaps the fact that the app is forced not to idle out restricting the garbage collection from ever happening? And why doesn't GC.Collect() just clean up the memory when explicitly called?
Is it perhaps something to do with the plan - in our case P1V3?
Is Application Insights contributing to added memory consumption? Or even worse, restricting it from releasing memory?
Is there anything else to do with regards to cleaning up the memory / resources that we are perhaps over looking?
Am I right to assume that the fact that after a few hours / days of usage (without restart) and not having released memory/resources properly, performance degradation is highly likely? What else contributes to performance degradation?
UPDATE -
After a further hour or two, I visited the App-specific metrics and compared it to the Application Insights (note that the App Service plan contains the Web API only!)
Why the discrepancy? Which one is right!?
Upvotes: 1
Views: 869
Reputation: 17010
Let me see if I can shed some light on your questions. First, .NET is designed to clear memory when it is needed. There are some other parameters, but it was found the "server" (original design) was more efficient if GC did not try to collect too often. The difference between having 1s and 0s stored in memory that are not used was less insidious than clearing out every time (read Richter's book on .NET to understand some internals - quite a bit still applies to the cloud).
As for normal, some may be, but you are growing much larger than I would normally see. And your increase in time from 2 seconds (which is far too long IMO) to 6 seconds indicates a problem. More than likely there is something in the code. Have I seen this before? Yes, in a horribly architected application - on the good side it was far worse and got to the point it would go from flying to a complete standstill requiring manual intervention. The reason here, which is probably different from yours, was a decision to make the data access on the cloud service side (not a real API) a central hub getting objects and command types and then parsing to send the correct command (i.e. to make app building automagic from POCOs, the designer of the app builder reinvented how LINQ to SQL works underneath the hood). Yes, this is a digression, but database clogs can easily cascade up to the web API, so it is a good place to check (data access or database code).
App not idling? Possibly, but more likely you have something else holding bits in memory, like deadlocks or race conditions on the database or some architectural issues in the web API.
As for GC, even explicitly calling it is not a magic bullet. In fact, it can make things worse than allowing the bits in the framework engine to handle GC for you. NOTE: Many sites recommend you don't explicitly call collect (and, yes, GC.Collect() is not guaranteed to force immediate collection if it is seen as problematic under the hood). In short, lack of GC is not likely the issue.
The plans have some affect on performance, but I have had an app that ran sweet on a lower level plan. The big issue causing moves to higher plans is being able to scale and be elastic, as well as some uptime features, etc.
As for memory, Insights adds a slight overhead, but it is not a big issue. Early on in Azure it was a bit more of a concern, but unless you are running at Google scale, I would not consider turning it off and would never consider it as my first option. I don't see how it would stop memory from being released.
Cleaning up resources? I really don't think this is the first place to look. More likely, it is something in the database or data access code or in the API code.
Degradation? If you have an app that has a memory leak, temporary or long term, or some other block condition, it almost always will cause performance degradation over time. This is especially true if you are creating multiple deadlocks in a database and forcing a time out. And, no, moving to a bunch of microservices built serverless (functions) will not magically solve the problem. I say this because I hear people suggesting this more than I would like.
TL;DR version
More than likely it is something in the code, data access, or database. APIs should be paper thin, overall, except perhaps some business rules to ensure you don't choke the consumer (the web site here). I would focus on seeing if there are issues in Azure SQL database (or other database?) and/or the API's data access code, as your screencap looks a lot like the issue we had with a vendor's code.
Upvotes: 1