Tiffany Morris
Tiffany Morris

Reputation: 323

Python Standard Deviation Check

I have written a python code to calculate the standard deviation of a list of numbers. I checked my answer on excel and it appears to be off. I'm not sure if I missed a step or if I should be concerned, but if anyone has a moment to review the code and see if they notice an error, please let me know. Thank you.

city_population = [2123,1284,7031,30788,147,2217,10000]

mean = sum(city_population,0.0)/len(city_population)

def stdev(city_population):
    length = len(city_population)
    total_sum = 0
    for i in range(length):
        total_sum += pow((city_population[i]-mean),2)
        result = (total_sum/(length-1))
        return sqrt(result)
stan_dev = stdev(city_population)
print "The standard deviation is",(stan_dev)

output: The standard deviation is 9443.71609738

excel: 9986.83890663

Upvotes: 0

Views: 1734

Answers (3)

Landry Houston
Landry Houston

Reputation: 1

Consider shortening your function for easier readability!

def standard_dev(nums):
return (sum([(num - (sum(nums) / len(nums))) ** 2 for num in nums]) / len(nums)) ** (1 / 2)

Upvotes: 0

LordSputnik
LordSputnik

Reputation: 735

Your problem is mostly due to the code within your loop for calculating the total sum. In this loop, you're also calculating the result at each iteration, and then returning from the function. This means that only one iteration of the loop runs.

When running your code, I get the result 2258.72114877, which is calculated from the first value only. By changing the code to the following, the correct sample standard deviation is produced:

city_population = [2123,1284,7031,30788,147,2217,10000]

mean = sum(city_population,0.0)/len(city_population)

def stdev(city_population):
    length = len(city_population)
    total_sum = 0
    for i in range(length):
        total_sum += pow((city_population[i]-mean),2)
    # total_sum is 698158659.4285713
    result = (total_sum/(length-1))
    # result is 116359776.57142855
    # sqrt(result) is 10787.01889177119
    return sqrt(result)

stan_dev = stdev(city_population)
print "The standard deviation is",(stan_dev)

The reason this new result is different to the value from Excel is that Excel is returning the population standard deviation. As a quick reference, the following page may be useful to you:

https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php

If there's no requirement for the code to be written from scratch, I'd recommend using Numpy to avoid reinventing the wheel here: http://www.numpy.org/ . With this, your code becomes:

import numpy
city_population = [2123,1284,7031,30788,147,2217,10000]
numpy.std(city_population, ddof=1)

A couple of additional tips: to avoid future confusion and potential issues, try to avoid naming function parameters the same as global variables. And try not to rely on previously set variables within a function (as you do with "mean" here).

Upvotes: 4

patapouf_ai
patapouf_ai

Reputation: 18693

The problem is that you have the return inside the loop!

The following should work:

def stdev(city_population):
    length = len(city_population)
    total_sum = 0
    for i in range(length):
        total_sum += pow((city_population[i]-mean),2)
    result = (total_sum/(length))
    return sqrt(result)

and not that for the standard deviation, you need to divide by length not length-1 (that would be if you have a sample, not the entire population).

Upvotes: 2

Related Questions