Python statistics module returns different standard deviation than calculated

Question

I have a list of numbers that I would like to calculate the standard deviation of. I calculated the value using two different methods: 1. using the Python statistics module and 2. using the formula for standard deviation. The result is two different, but somewhat close numbers. Is there something different about how the statistics module calculates standard deviation or is this something to do with my coded calculation? I am also unaware how math.sqrt() works internally, but I assume it uses some type of approximation.

import statistics
import math    

def computeSD_S(variable):
    # Open the file and read the values in the column specified
    var_list = openAndReadVariable(variable)
    # Try to compute the median using the statistics module and print an error if a string is used as input
    try:
        st_dev = statistics.stdev(var_list)
        return st_dev
    except TypeError:
        return 'Variable values must be numerical.'

def computeSD_H(variable):
    # Open the file and read the values in the column specified
    var_list = openAndReadVariable(variable)
    sum = 0
    # Try to compute the standard deviation using this formula and print an error if a string is used as input
    try:
        # Find the mean
        mean = statistics.mean(var_list)
        # Sum the squared differences
        for obs in var_list:
            sum += (obs-mean)**2
        # Take the square root of the sum divided by the number of observations
        st_dev = math.sqrt(sum/len(var_list))
        return st_dev
    except TypeError:
        return 'Variable values must be numerical.'

variable = 'Total Volume'
st_dev = computeSD_S(variable)
print('Standard Deviation', st_dev)
st_dev = computeSD_H(variable)
print('Standard Deviation', st_dev)

Resulting Output:

Standard Deviation 3453545.3553994712
Standard Deviation 3453450.731237387

In addition to computing the mean using the statistics module, I also computed the mean by hand and received the same results.

Mantas Kandratavičius · Accepted Answer

There is the what and the why:

The what is your own algorithm divides be the amount of elements you have in your array instead of elements in your array - 1.

Now why should you divide by N-1 and not N?

This post seems to have a very good explanation and you can find a lot more resources explaining why the formula for standard deviation divides by N-1 instead of N.

If we peek at the standard deviation documentation we can see:

statistics.stdev(data, xbar=None)

Return the sample standard deviation (the square root of the sample variance).

It calculates the sample standard deviation (aka division by N-1). Solution 1 would be to match your function with stdev by modifying the division.

Solution 2, is replacing stdev with pstdev:

statistics.pstdev(data, mu=None)

Return the population standard deviation (the square root of the population variance).

pstdev calculates the population standard deviation, or in other words, the same thing that your current function calculates.

Python statistics module returns different standard deviation than calculated

Answers (1)

Related Questions