Why does one need to write fault tolerant applications when building on cloud infrastructure?

Question

I got this interview question today 'Why do you need to write fault tolerant applications when building on cloud infrastructure?'
I answered: They are hard to debug and hard to fix, so they better be very well tested and robust. Data in database can get changed between subsequent reads (there is no state server), and there are just to many things that can fail in-between, so one has to 'prepare' for unexpected.

Have I answered it correctly and did I miss anything?

Mick · Accepted Answer

I don't think it was a particularly good question.

They were possibly thinking of some cloud based systems, typically large scale, which have many separate components often running on separate hardware, and you would not want some task running over, for example, 1000 servers to stop simply because of a fault or HW failure on one of the servers.

The greater the number of servers the greater the statistical chance of one of them failing during any given 'run' so they were possibly trying to tease this out, and get you to say that the overall system should 'tolerate' a failure of one or more individual servers.

However, there are many small non-critical websites/blogs/web apps which also run on cloud infrastructure and which may not require, or justify the expense of, a focus on fault tolerance.

Similarly, there are plenty of non-cloud applications which should have fault tolerance as a key part of the design considerations. For example a pacemaker, or a car airbag controller, even though they are not running in the cloud (I hope...).

Why does one need to write fault tolerant applications when building on cloud infrastructure?

Answers (1)

Related Questions