JKC
JKC

Reputation: 2618

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far

# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()

salary         8.194421e-01
num_months     3.690081e+05
no_of_hours    2.518869e+02

# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)

# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()

salary         8.194421e-01

# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)

0       4.374107e+12
1       4.377543e+12
2       4.374026e+12
3       4.374046e+12
4       4.374112e+12
5       4.373926e+12

When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?

# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN

Upvotes: 6

Views: 7306

Answers (1)

jezrael
jezrael

Reputation: 862511

It is expected, because if checking DataFrame.std:

Normalized by N-1 by default. This can be changed using the ddof argument

If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.

Sample:

inp_df = pd.DataFrame({'salary':[10,20,30],
                       'num_months':[1,2,3],
                       'no_of_hours':[2,5,6]})
print (inp_df)
   salary  num_months  no_of_hours
0      10           1            2
1      20           2            5
2      30           3            6

Select one column by one [] for Series:

print (inp_df['salary'])
0    10
1    20
2    30
Name: salary, dtype: int64

Get std of Series - get a scalar:

print (inp_df['salary'].std())
10.0

Select one column by double [] for one column DataFrame:

print (inp_df[['salary']])
   salary
0      10
1      20
2      30

Get std of DataFrame per index (default value) - get one element Series:

print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary    10.0
dtype: float64

Get std of DataFrame per columns (axis=1) - get all NaNs:

print (inp_df[['salary']].std(axis = 1))
0   NaN
1   NaN
2   NaN
dtype: float64

If changed default ddof=1 to ddof=0:

print (inp_df[['salary']].std(axis = 1, ddof=0))
0    0.0
1    0.0
2    0.0
dtype: float64

If you want std by two or more columns:

#select 2 columns
print (inp_df[['salary', 'num_months']])
   salary  num_months
0      10           1
1      20           2
2      30           3

#std by index
print (inp_df[['salary','num_months']].std())
salary        10.0
num_months     1.0
dtype: float64

#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0     5.656854
1    10.606602
2    16.970563
dtype: float64

Upvotes: 6

Related Questions