Reputation: 191
First of all, I've tried all recommendations from C# DNS-related SO threads and other internet articles - messing with ServicePointManager/ServicePoint settings, setting automatic request connection close via HTTP headers, changing connection lease times - nothing helped. It seems like all those settings are intended for fixing DNS issues in long-running processes (like web services). It even makes sense if a process would have it's own DNS cache to minimize DNS queries or OS DNS cache reading. But it's not my case.
Our production infrastructure uses HA (high availability) DNS for swapping server nodes during maintenance or functional problems. And it's built in a way that in some places we have multiple CNAME-records which in fact point to the same HA A-record like that:
The TTL of all these records is 60 seconds. So when the European node is in trouble or maintenance, the A-record switches to the IP address of some other node.
Then we have a monitoring utility which is executed once in 5 minutes and uses both site1 and site2. For it to work properly both names must point to the same DC, because data sync between DCs doesn't happen that fast. Since both CNAMEs are in fact linked to the same A-record with short TTL at a first glance it seems like nothing can go wrong. But it turns out it can.
The utility is written in C# for .NET Framework 4.7.2 and uses HttpClient class for performing requests to both sites. Yeah, it's him again.
We have noticed that when a server node switch occurs the utility often starts acting as if site1 and site2 were in different DCs. The pattern of its behavior in such moments is strictly determined, so it's not like it gets confused somewhere in the middle of the process - it incorrecly resolves one or both of these addresses from the very start.
I've made another much simpler utility which just sends one GET-request to site1 and then started intentionally switching nodes on and off and running this utility to see which DC would serve its request. And the results were very frustrating.
Despite the Windows DNS cache already being updated (checked via ipconfig
and Get-DnsClientCache
cmdlet) and despite the overall records' TTL of 60 seconds the HttpClient keeps sending requests to the old IP address sometimes for another 15-20 minutes. Even when I've completely shut down the "outdated" application server - the utility kept trying to connect to it, so even connection failures don't wake it up.
It becomes even more frustrating if you start running ipconfig /flushdns
in between utility runs. Sometimes after flushdns the utility realizes that the IP has changed. But as soon as you make another flushdns (or this is even not needed - I haven't 100% clearly figured this out) and run the utility again - it goes back to the old address! Unbelievable!
And add even more frustration. If you resolve the IP address from within the same utility using Dns.GetHostEntry method (which uses cache as per this comment) right before calling HttpClient, the resolve result would be correct... But the HttpClient would anyway make a connection to an IP address of seemengly his own independent choice. So HttpClient somehow does not seem to rely on built-in .NET Framework DNS resolving.
So the questions are:
P.S. I have worked this all around by implementing a custom HttpClientHandler which performs DNS queries on each hostname's first usage thus it's independent from external DNS caches (except for caching at intermediate DNS servers which also affects things to some extent). But that was a little tricky in terms of TLS certificates validation and the final solution does not seem to be production ready - but we use it for monitoring only so for us it's OK. If anyone is interested in this, I'll show the class code which somewhat resembles this answer's example.
The utility works from behind a corporate proxy. In fact there are multiple proxies for load balancing. So I am now also in process of verifying this:
The answer to "Does .NET Framework has an OS-independent global DNS cache?" is NO. HttpClient class or .NET Framework in general had nothing to do with all of this. Posted my investigation results as an accepted answer.
Upvotes: 8
Views: 1508
Reputation: 191
HttpClient, please forgive me! It was not your fault!
Well, this investigation was huge. And I'll have to split the answer into two parts since there turned out to be two unconnected problems.
As I said, the utility was being tested from behind a corporate proxy. In case if you haven't known (like I haven't till the latest days) when using a proxy server it's not your machine performing DNS queries - it's the proxy server doing this for you.
I've made some measurements to understand for how long does the utility keep connecting to the wrong DC after the DNS record switch. And the answer was the fantastic exact 30 minutes. This experiment has also clearly shown that local Windows DNS cache has nothing to do with it: those 30 minutes were starting exactly at the point when the proxy server was waking up (was finally starting to send HTTP requests to the right DC).
The exact number of 30 minutes has helped one of our administrators to finally figure out that the proxy servers have a configuration parameter of minimal DNS TTL which is set to 1800 seconds by default. So the proxies have their own DNS cache. These are hardware Cisco proxies and the admin has also noted that this parameter is "hidden quite deeply" and is not even mentioned in the user manual.
As soon as the minimal proxies' DNS TTL was changed from 1800 seconds to 1 second (yeah, admins have no mercy) the issue stopped reproducing on my machine.
Well. As I also said there are several proxies. There is a single corporate proxy DNS name, but if you run nslookup
for it - it shows multiple IPs behind it. Each time the proxy server's IP address is resolved (for example when local cache expires) - there's quite a bit of a chance that you'll jump onto another proxy server.
And that's exactly what ipconfig /flushdns
has been doing to me. As soon as I started playing around with proxy servers using their direct IP addresses instead of their common DNS name I found that different proxies may easily route identical requests to different DCs. That's because some of them have those 30-minutes-cached DNS records while others have to perform resolving.
Unfortunately, after the proxies theory has been proven, another news came in: the production monitoring servers are placed outside of the corporate network and they do not use any proxy servers. So here we go...
The monitoring servers are configured to use 8.8.8.8 and 8.8.4.4 Google's DNS servers. Resolve responses for our short-lived DNS records from these servers are somewhat weird:
Windows is not stupid and according to my experiments it doesn't care about CNAME's TTL and only cares about the root A-record TTL, so its client cache even for CNAME records is never assigned a TTL higher than 60 seconds.
But due to the inconsistency (or in some sense over-consistency?) of the A-record TTL which Google's servers return (unpredictable 0-60 seconds) the Windows local cache gets confused. There were two facts which demonstrated it:
Resolve-DnsName
for site1 and site2 over several minutes with random pauses between them have eventually led to Get-ClientDnsCache
showing the local cache TTLs of the two site names diverged on up to 15 seconds. This is a big enough difference to sometimes mess the things up. And that's just my short experiment, so I'm quite sure that it might actually get bigger.Invoke-WebRequest
to each of the sites one right after another once in every 3-5 seconds while switching the DNS records has let me twicely face a situation when the requests went to different DCs.The latter experiment had one strange detail I can't explain. Calling
Get-DnsClientCache
afterInvoke-WebRequest
shows no records appear in the local cache for the just-requested site names. But anyway the problem clearly has been reproduced.
It would take time to see whether my workaround with real-time DNS resolving would bring any improvement. Unfortunately, I don't believe it will - the DNS servers used at production (which would eventually be used by the monitoring utility for real-time IP resolving) are public Google DNS which are not reliable in my case.
And one thing which is worse than an intermittently failing monitoring utility is that real-world users are also relying on public DNS servers and they definitely do face problems during our maintenance works or significant failures.
So have we learned anything out of all this?
I have no idea. But its quite possible that "yes" is the right answer to all of these questions.
However there is one thing we surely have learned: network hardware manufacturers shall write their documentation better.
Upvotes: 5