Reputation: 64004
I have the following CSV data:
id,gene,celltype,stem,stem,stem,bcell,bcell,tcell
id,gene,organs,bm,bm,fl,pt,pt,bm
134,foo,about_foo,20,10,11,23,22,79
222,bar,about_bar,17,13,55,12,13,88
What I do then is to select the 3rd and 4th columns:
import pandas as pd
df = pd.read_csv("http://dpaste.com/1X74TNP.txt",header=None)
df_genes = df.iloc[2:]
df_genes[df_genes.columns[[3,4]]]
Which gives:
Out[217]:
3 4
2 20 10
3 17 13
But when I average them it gives this:
In [219]: df_genes[df_genes.columns[[3,4]]].mean(axis=1)
Out[219]:
2 1005.0
3 856.5
dtype: float64
What's the right way to do it? The correct result is 15 for all rows.
Upvotes: 2
Views: 1552
Reputation: 1033
As cel pointed out, the dtype of the columns is not correct. If you need to read in the entire data set, and cannot use skip rows as suggested by cel, an alternative would be to add the astype() method prior to mean():
In [32]: df_genes[df_genes.columns[[3,4]]].astype('float64').mean(axis=1)
Out[32]:
2 15
3 15
dtype: float64
I always try to check the dtype of columns prior to performing operations, because the wrong dtype can lead to strange results.
Upvotes: 3
Reputation: 31349
In pandas
all values in a dataframe column have the same data type. Do not read the first two annotation rows. pandas
will fail to recognize that these columns are in fact numeric.
import pandas as pd
df = pd.read_csv("http://dpaste.com/1X74TNP.txt", skiprows=2, header=None)
df_genes = df[[3,4]]
df_genes.mean(axis=1)
Upvotes: 5