neversaint
neversaint

Reputation: 64004

Why averaging selected columns in Pandas gives wrong result?

I have the following CSV data:

id,gene,celltype,stem,stem,stem,bcell,bcell,tcell
id,gene,organs,bm,bm,fl,pt,pt,bm
134,foo,about_foo,20,10,11,23,22,79
222,bar,about_bar,17,13,55,12,13,88

What I do then is to select the 3rd and 4th columns:

import pandas as pd
df = pd.read_csv("http://dpaste.com/1X74TNP.txt",header=None)
df_genes = df.iloc[2:]
df_genes[df_genes.columns[[3,4]]]

Which gives:

Out[217]:
    3   4
2  20  10
3  17  13

But when I average them it gives this:

In [219]: df_genes[df_genes.columns[[3,4]]].mean(axis=1)
Out[219]:
2    1005.0
3     856.5
dtype: float64

What's the right way to do it? The correct result is 15 for all rows.

Upvotes: 2

Views: 1552

Answers (2)

Steve Misuta
Steve Misuta

Reputation: 1033

As cel pointed out, the dtype of the columns is not correct. If you need to read in the entire data set, and cannot use skip rows as suggested by cel, an alternative would be to add the astype() method prior to mean():

In [32]: df_genes[df_genes.columns[[3,4]]].astype('float64').mean(axis=1)
Out[32]: 
2    15
3    15
dtype: float64

I always try to check the dtype of columns prior to performing operations, because the wrong dtype can lead to strange results.

Upvotes: 3

cel
cel

Reputation: 31349

In pandas all values in a dataframe column have the same data type. Do not read the first two annotation rows. pandas will fail to recognize that these columns are in fact numeric.

import pandas as pd
df = pd.read_csv("http://dpaste.com/1X74TNP.txt", skiprows=2, header=None)
df_genes = df[[3,4]]
df_genes.mean(axis=1)

Upvotes: 5

Related Questions