Reputation: 323
I have written a python code to calculate the standard deviation of a list of numbers. I checked my answer on excel and it appears to be off. I'm not sure if I missed a step or if I should be concerned, but if anyone has a moment to review the code and see if they notice an error, please let me know. Thank you.
city_population = [2123,1284,7031,30788,147,2217,10000]
mean = sum(city_population,0.0)/len(city_population)
def stdev(city_population):
length = len(city_population)
total_sum = 0
for i in range(length):
total_sum += pow((city_population[i]-mean),2)
result = (total_sum/(length-1))
return sqrt(result)
stan_dev = stdev(city_population)
print "The standard deviation is",(stan_dev)
output:
The standard deviation is 9443.71609738
excel: 9986.83890663
Upvotes: 0
Views: 1734
Reputation: 1
Consider shortening your function for easier readability!
def standard_dev(nums):
return (sum([(num - (sum(nums) / len(nums))) ** 2 for num in nums]) / len(nums)) ** (1 / 2)
Upvotes: 0
Reputation: 735
Your problem is mostly due to the code within your loop for calculating the total sum. In this loop, you're also calculating the result at each iteration, and then returning from the function. This means that only one iteration of the loop runs.
When running your code, I get the result 2258.72114877, which is calculated from the first value only. By changing the code to the following, the correct sample standard deviation is produced:
city_population = [2123,1284,7031,30788,147,2217,10000]
mean = sum(city_population,0.0)/len(city_population)
def stdev(city_population):
length = len(city_population)
total_sum = 0
for i in range(length):
total_sum += pow((city_population[i]-mean),2)
# total_sum is 698158659.4285713
result = (total_sum/(length-1))
# result is 116359776.57142855
# sqrt(result) is 10787.01889177119
return sqrt(result)
stan_dev = stdev(city_population)
print "The standard deviation is",(stan_dev)
The reason this new result is different to the value from Excel is that Excel is returning the population standard deviation. As a quick reference, the following page may be useful to you:
https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php
If there's no requirement for the code to be written from scratch, I'd recommend using Numpy to avoid reinventing the wheel here: http://www.numpy.org/ . With this, your code becomes:
import numpy
city_population = [2123,1284,7031,30788,147,2217,10000]
numpy.std(city_population, ddof=1)
A couple of additional tips: to avoid future confusion and potential issues, try to avoid naming function parameters the same as global variables. And try not to rely on previously set variables within a function (as you do with "mean" here).
Upvotes: 4
Reputation: 18693
The problem is that you have the return inside the loop!
The following should work:
def stdev(city_population):
length = len(city_population)
total_sum = 0
for i in range(length):
total_sum += pow((city_population[i]-mean),2)
result = (total_sum/(length))
return sqrt(result)
and not that for the standard deviation, you need to divide by length not length-1 (that would be if you have a sample, not the entire population).
Upvotes: 2