Brad
Brad

Reputation: 1128

Google Cloud DNS changing Record Resource update time

I am currently trying to map a Compute Engine VM with an ephemeral IP to a hostname using Google Cloud DNS, this operation takes place at the VM startup time. I am doing this through a shell script as it follows:

gcloud dns record-sets transaction start -z=MY_ZONE
gcloud dns record-sets transaction remove --zone MY_ZONE \
    --name subd.domain.com \
    --type A "1.2.3.4" \ #the old external ip for the VM
    --ttl 300
gcloud dns record-sets transaction add --zone MY_ZONE \
    --name subd.domain.com \
    --type A "5.6.7.8" \ #the new external ip for the VM
    --ttl 300
gcloud dns record-sets transaction execute -z=MY_ZONE

After the script is run I can see the records successfully changed in the Cloud DNS UI, with the "A" RR having the new external IP.

What happens now is that it takes a really long time for these changes to really go live. Accessing the hostname "subd.domain.com" after the change returns an "NXDOMAIN" status lasting a long period of time and only after that it finally maps the domain to the new IP.

This situation raised two questions for me:

#1 Why does the DNS go through the NXDOMAIN phase? Shouldn't those changes act as an Update(due to running this in a transaction) and not as a Remove then Create.

#2 What determines the time for this record update to go live?

Upvotes: 4

Views: 3219

Answers (2)

Colin Hines
Colin Hines

Reputation: 56

DNS changes propagate (spread out) in a few different ways. These various ways have evolved as DNS has evolved and can be found in various RFCs (https://www.isc.org/community/rfcs/dns/).

In your example, you are attempting to change the A record (an IP address mapping to a name) for a resource in the "domain.com" DNS zone. The method that you are showing first deletes the old record and then adds the new one using a "transaction". The style of using a "transaction" is unique to Google's Cloud DNS offering, is not part of a documented RFC, and may (or may not) be impacting the speed at which successful resolution is completing. (Side note, a transaction, even if it includes multiple DNS changes, only increments the SOA serial by 1 in my testing)

First, a short refresher for me on the tech.

There are several servers that function as the "authoritative" source for records for that zone (domain.com). Those servers that are authoritative are the ones listed in the NS record for the zone. These are the servers that will answer requests when someone in a browser tries to browse to that name or someone tries to ping it, etc. The way that these servers are accessed when someone pings or someone tries to access it in a browser is highly variable and beyond this answer (google for "client DNS resolution")

Typically, Google Cloud DNS will give you four authoritative servers with names such as for any newly created zone:

  • ns-cloud-a1.googledomains.com
  • ns-cloud-a2.googledomains.com
  • ns-cloud-a3.googledomains.com
  • ns-cloud-a4.googledomains.com

DNS name query utilities such as "dig" or "nslookup" can be used to query the status of each of these authoritative servers individually.

Now, when you run the "gcloud dns ..." command to delete and/or add a record, it's not using documented DNS methods to facilitate transfers of records to all the authoritative servers. The best I can tell from my testing is that it is updating some central database first with your change, which then kicks off a process to update (maybe using DNS style notifications?) the servers themselves. This can be seen by the fact that the Google Cloud DNS UI is able to show your update sometimes before any of the authoritative servers are actually registering your transaction change.

Next, as the change is either sent and/or pulled by the set of authoritative servers, it appears to take a while to completely become consistent on each server. (Zach Bjornson posted this analysis of the time for DNS record additions using GCP and AWS to converge, along with code to perform the analysis yourself, at http://blog.zachbjornson.com/2018/08/14/dns-propagation.html.)

One of his diagrams shows the time to reach consistency: Time for GCP DNS Servers to Update

Although each of the four servers listed above (ns-cloud-xxx...) only has a single IP address, they are anycast IPs, which means they can live in multiple networks at multiple datacenters. So, although ns-cloud-a1.googledomains.com resolves to 216.239.32.106, that IP might exist on servers in Miami, Tampa, Orlando, Atlanta, and several other locations. When you attempt to communicate with it, the networks that your flow is traversing will take you to the closest one (thanks @BGP!). In my testing (running dig twenty times a second against each of the published googledomains.com nameservers for a domain running a transaction similar to the one you posted), it appears that the change slowly converges, meaning that for the first several seconds, it is the old IP (the one being deleted), and then the dig requests start to show that it's changing but it may be only one request out of every 40 or 50 that show the new IP. Over the next minute or so, the new IP is returned more and more often until 100% of the requests are showing the new IP.

In my limited testing, I never received an NXDOMAIN for the records tested, all of the (several thousand) requests either returned the old or the new IP for the record reqeusted.

Now, at this point, all of the authoritative servers have converged to the new IP address for the A record of "subd.domain.com". DNS resolvers or clients likely have some sort of local cache that they use to minimize network requests for frequently requested records. Some of these clients will respect the TTL (time-to-live) of the requested record, but some may not (I've not seen a lot of consistency with implementations IMHO). The TTL is a "suggestion" to the that those requesting the record may follow to ensure they have a good balance between minimizing network traffic and ensuring the accuracy of the record's value. So, now, the client may have the old record cached for the TTL, so it may need to expire before it attempts to request the record again. In your example, the TTL of 300 is 300 seconds, which is 5 minutes, so you could have up to 5 minutes to wait.

Upvotes: 4

John Hanley
John Hanley

Reputation: 81366

I will offer this suggestion based upon decades of DNS experience. Don't treat DNS as your on-demand database. The DNS ecosystem is not designed to support what you are trying to do. Each link in the chain caches DNS entries. You have no control over this process. In your example your TTL is 300 seconds. It will take as least 5 minutes before the next server above yours expires your entries. A lot of caches ignore your TTL and set it to hours or sometimes even days. You need to design your DNS setup to be "eventually consistent" and not "instantly consistent". Eventually means hours or days.

When planning for DNS changes, I plan for minimum of 48 hours for the change to take effect. This means that we maintain services on the old DNS entry while the new DNS entry takes effect.

Upvotes: 3

Related Questions