Reputation: 579
Here are two systems, A and B. How to calculate the downtime of each.
For A, should it be: 0.01 * 10 * 6 * 12 = 7.2 hours/year?
A system has 10 physical nodes, if any of those nodes failed, the whole system go down. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.
B system has 10 physical nodes, if 9 out of 10 nodes is running the whole system can function as normal. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.
Upvotes: 3
Views: 654
Reputation: 4138
1% failure rate / month / node has a probability of 0,00138889% to fail at any given hour. I used binomial distribution in Excel to model the probability of N node failures when there are 8760 h/y * 10 nodes = 87600 "trials". I got these results:
0 failure: 29.62134067 %
1 failure: 36.03979837 %
2 failure: 21.92426490 %
3 failure: 8.89142792 %
4 failure: 2.70442094 %
5 failure: 0.65805485 %
6 failure: 0.13343314 %
...and so forth
N failures would cause 6N hours of downtime (asusming they are independent). Then for each 6N hours of single-node downtime the probability of having none of other 9 nodes to fail is (100% - 0,00138889%) ^ (9 * 6N)
.
Thus expected two-node downtime is P(1 node down) * (1 - P(no other node down)) * 6 hours / 2
(divided by two because on average 2nd failure occurs in mid-point of other node being repaired). When summed over all N
numbers of failures I got expected two-node downtime of 9.8 seconds / year, now idea how correct estimate this is but should give a rough idea. Quite brute-force solution :/
Upvotes: 0
Reputation: 1426
We are talking about expected downtimes here, so we'll have to take a probabalistic approach.
We can take a Poisson approach to this problem. The expected failure rate is 1% per month for a single node, or 120% (1.2) for 10 nodes in 12 months. So you are correct that 1.2 failures/year * 6 hours/failure = 7.2 hours/year for the expected value of A.
You can figure out how likely a given amount of downtime is by using 7.2 as the lambda value for the poisson distribution.
Using R: ppois(6, lambda=7.2) = 0.42, meaning there is a 42% chance that you will have less than 6 hours of downtime in a year.
For B, it's also a Poisson, but what's important is the probability that a second node will fail in the six hours after the first failure.
The failure rate (assuming a 30 day month, with 120 6 hour periods) is 0.0083% per 6 hour period per node.
So we look at the chances of two failures within six hours, times the number of six hour periods in a year.
Using R: dpois(2.0, lambda=(0.01/120)) * 365 * 4 = 0.000005069
0.000005069 * 3 expected hours/failure = 54.75 milliseconds expected downtime per year. (3 expected hours per failure because the second failure should occur on average half way through the first failure.)
Upvotes: 4