Orup
Orup

Reputation: 579

How to calculate network system downtime

Here are two systems, A and B. How to calculate the downtime of each.

For A, should it be: 0.01 * 10 * 6 * 12 = 7.2 hours/year?

A system has 10 physical nodes, if any of those nodes failed, the whole system go down. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.

B system has 10 physical nodes, if 9 out of 10 nodes is running the whole system can function as normal. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.

Upvotes: 3

Views: 654

Answers (2)

NikoNyrh
NikoNyrh

Reputation: 4138

1% failure rate / month / node has a probability of 0,00138889% to fail at any given hour. I used binomial distribution in Excel to model the probability of N node failures when there are 8760 h/y * 10 nodes = 87600 "trials". I got these results:

0 failure:   29.62134067 %
1 failure:   36.03979837 %
2 failure:   21.92426490 %
3 failure:    8.89142792 %
4 failure:    2.70442094 %
5 failure:    0.65805485 %
6 failure:    0.13343314 %
...and so forth

N failures would cause 6N hours of downtime (asusming they are independent). Then for each 6N hours of single-node downtime the probability of having none of other 9 nodes to fail is (100% - 0,00138889%) ^ (9 * 6N).

Thus expected two-node downtime is P(1 node down) * (1 - P(no other node down)) * 6 hours / 2 (divided by two because on average 2nd failure occurs in mid-point of other node being repaired). When summed over all N numbers of failures I got expected two-node downtime of 9.8 seconds / year, now idea how correct estimate this is but should give a rough idea. Quite brute-force solution :/

Excel calculations

Upvotes: 0

Chris Meyers
Chris Meyers

Reputation: 1426

We are talking about expected downtimes here, so we'll have to take a probabalistic approach.

We can take a Poisson approach to this problem. The expected failure rate is 1% per month for a single node, or 120% (1.2) for 10 nodes in 12 months. So you are correct that 1.2 failures/year * 6 hours/failure = 7.2 hours/year for the expected value of A.

You can figure out how likely a given amount of downtime is by using 7.2 as the lambda value for the poisson distribution.

Using R: ppois(6, lambda=7.2) = 0.42, meaning there is a 42% chance that you will have less than 6 hours of downtime in a year.

For B, it's also a Poisson, but what's important is the probability that a second node will fail in the six hours after the first failure.

The failure rate (assuming a 30 day month, with 120 6 hour periods) is 0.0083% per 6 hour period per node.

So we look at the chances of two failures within six hours, times the number of six hour periods in a year.

Using R: dpois(2.0, lambda=(0.01/120)) * 365 * 4 = 0.000005069

0.000005069 * 3 expected hours/failure = 54.75 milliseconds expected downtime per year. (3 expected hours per failure because the second failure should occur on average half way through the first failure.)

Upvotes: 4

Related Questions