Reputation: 71
Single-Region Spanner is advertised with a 99.99% availability SLA. In the US-based configuration, there will be exactly three replicas per node, all in Council Bluffs, Iowa. Can you share information that breaks down why the 99.99% (~one hour of downtime per year) is believable, especially in the case of geographically-local disasters? I assume that Google has done a thorough analysis, or else it would not advertise the SLA, but I cannot find a detailed paper.
In the event of a regional failure, what recovery procedures will Google carry out and with what recovery time / expected data loss?
(I understand that multi-region may be available, and have seen some pricing data, but will not discuss this here).
Upvotes: 3
Views: 268
Reputation: 394
Spanner automatically replicates data for high availability. As you stated, regional instances have three full copies of data. The key is that they are replicated across three zones within the region which have independent power, cooling, networking, etc. Zones generally fail independently for each other, so your other replicas can continue serving reads and writes even if one zone goes down. Multi-region provides even greater availability by replicating across regions.
Zonal failures are very rare and would be transparent to your application; Cloud Spanner automatically reroutes requests to replicas that are able to serve the request. It would be even rarer for a region to go down with data loss. Google takes many measures against disasters.
Further out we will expose managed backups, but these would still be stored within Google data centers. We're also working on a Dataflow connector to help you import/export data should you want to manage your own backups.
Upvotes: 1