Reputation: 63994
I have a CSV that looks like this:
gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3
And as data frame it looks like this:
In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
gene stem1 stem2 stem3 b1 b2 b3 special_col
0 foo 20 10 11 23 22 79 3
1 bar 17 13 505 12 13 88 1
2 qui 17 13 5 12 13 88 3
What I want to do is to perform pearson correlation from last column (special_col
) with every columns between gene
column and special column
, i.e. colnames[1:number_of_column-1]
At the end of the day we will have length 6 data frame.
Coln PearCorr
stem1 0.5
stem2 -0.5
stem3 -0.9999453506011533
b1 0.5
b2 0.5
b3 -0.5
The above value is computed manually:
In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)
How can I do that?
Upvotes: 27
Views: 97049
Reputation: 481
pd.DataFrame.corrwith()
can be used instead of df.corr()
.
pass in the intended column for which we want correlation with the rest of the columns.
For specific example above the code will be: df.corrwith(df['special_col'])
or simply df.corr()['special_col']
to create entire correlation of each column with other columns and subset what you need.
Upvotes: 24
Reputation: 31662
Why not just do:
In [34]: df.corr().iloc[:-1,-1]
Out[34]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
or:
In [39]: df.corr().ix['special_col', :-1]
Out[39]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
Timings
In [35]: %timeit df.corr().iloc[-1,:-1]
1000 loops, best of 3: 576 us per loop
In [40]: %timeit df.corr().ix['special_col', :-1]
1000 loops, best of 3: 634 us per loop
In [36]: %timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 968 us per loop
In [37]: %timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
100 loops, best of 3: 2.12 ms per loop
Upvotes: 7
Reputation: 5976
Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.
If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.
In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
If you are interested in speed, this is slightly faster on my machine:
In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]:
array([ 0.5 , -0.5 , -0.99994535, 0.5 , 0.5 ,
-0.5 ])
In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 µs per loop
In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 µs per loop
But obviously, it returns an array, not a pandas series/DF.
Upvotes: 24
Reputation: 393973
You can apply
on your column range with a lambda
that calls corr
and pass the Series
'special_col'
:
In [126]:
df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
Out[126]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
dtype: float64
Timings
Actually the other method is quicker so I expect it to scale better:
In [130]:
%timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
%timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 1.75 ms per loop
1000 loops, best of 3: 836 µs per loop
Upvotes: 12