Reputation: 481
There's two parts to this question. First, what falls under the purview of the Diagnostics---MaxDiskQuotaInMB configuration? Is it everything under SvcFab/Log? Just SvcFab/Log/AppInstanceData/? Having more info on this would be nice.
Second, what is the proper course of action if the FabricDCA.exe is running but the SvcFab/Log and SvcFab/Log/AppInstanceData/ folders exceed the limits we've set on their size? My team set them to 10,000 MB, but SvcFab/Log regularly takes up 12-16 GB.
The cluster configuration on Azure recognizes the change to the MaxDiskQuotaInMB configuration but there seems to be no impact on the node itself. I've tried resetting FabricDCA.exe as well and so far it has not helped either (after several hours).
One node in our cluster had so much space taken up by logs (over our limit) that remaining storage space was reduced to 1 MB.
Upvotes: 2
Views: 846
Reputation: 46
Posting a more complete answer since it may be helpful to other people.
Most of the things under SvcFab/Log folder should fall under the quota set by MaxDiskQuotaInMB. There are a few things that may not, but the majority of things that usually take disk space are included. Keep in mind also that the task cleaning the disk usually runs every 5 minutes so you may see usage go over the quota within this timeframe.
If FabricDCA.exe is not properly cleaning files from this folder it is possible that you are hitting a bug in .Net runtime where all system.threading.timers stop firing and the disk to not be cleaned because FabricDCA relies on these timers to do so. This is the bug on the .NET core side tracking the issue: (https://github.com/dotnet/coreclr/issues/26771). It seems to happen when the machine is running out of memory intermittently.
There is an auto-mitigation added in FabricDCA in Service Fabric 7.0. The manual mitigation is usually to kill FabricDCA.exe process. The process should start again and after a few minutes it will start cleaning again.
You mentioned that you already tried killing FabricDCA.exe so maybe the solution above does not work for you. In this case, try taking a look at the Service Fabric cluster manifest directly, it might be the case where your new configurations seem to be accepted by the ARM template deployment but the new configuration doesn't reach the cluster manifest which is the source of truth in this case.
Update: There was a regression introduced as part of the auto-mitigation above which caused The AppInstanceFolder to fill up the disk. This is fixed in SF version 7.0.466
Upvotes: 3