Reputation: 1134
Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).
What I have:
ID Val1 Val2 OtherData OtherData
A 5 4 x x
A 4 5 x x
A 6 6 x x
B 4 1 x x
B 8 2 x x
B 7 9 x x
C 4 8 x x
C 5 5 x x
C 2 1 x x
What I need:
ID Correlation_Val1_Val2
A 0.12
B 0.22
C 0.05
Upvotes: 50
Views: 70031
Reputation: 1
Interestingly this also works, with elegant output
df.groupby('ID')['Val1'].corr(df['Val2'])
Upvotes: 0
Reputation: 3295
Combining other's answers to something that does not rely on implicit indices.
Getting correlation by group is relatively simple:
corr_df = op_data\
.groupby('ID')[['Val1', 'Val2']]\
.corr().unstack()[[('Val1', 'Val2')]]\
.reset_index()
ID Val1
Val2
0 A 0.500000
1 B 0.385727
2 C 0.714575
Now, to get the desired flat column structure we can use some tricks
corr_df.columns = corr_df.columns.map(lambda x: '_'.join(a for a in x if len(a)>0))
ID Val1_Val2
0 A 0.500000
1 B 0.385727
2 C 0.714575
p.s. to parse OPs data:
for rep in range(10):
data_str = data_str.replace(' ', ' ')
op_data = pd.read_csv(io.StringIO(data_str), sep=' ', engine='python')
Upvotes: 1
Reputation: 609
If you also need to keep the structure of your dataframe here's what I came up with:
Assuming you start with this dataframe:
corr_df = x.groupby('ID')[['Val1','Val2']].corr().unstack()
corr_df.columns = ['Correlation_Val1_Val1', 'Correlation_Val1_Val2', 'Correlation_Val2_Val1', 'Correlation_Val2_Val2']
corr_df = corr_df.reset_index().drop(columns=['Correlation_Val1_Val1', 'Correlation_Val2_Val2', 'Correlation_Val2_Val1']) 'v2v1cor', 'v2v2cor'])
which results in:
Upvotes: 0
Reputation: 390
One more simple solution:
df.groupby('ID')[['Val1','Val2']].corr().unstack().iloc[:,1]
Upvotes: 16
Reputation: 30414
You pretty much figured out all the pieces, just need to combine them:
>>> df.groupby('ID')[['Val1','Val2']].corr()
Val1 Val2
ID
A Val1 1.000000 0.500000
Val2 0.500000 1.000000
B Val1 1.000000 0.385727
Val2 0.385727 1.000000
In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:
>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]
ID
A Val1 0.500000
B Val1 0.385727
For 3 or more variables, it is not straightforward to create concise output but you could do something like this:
groups = list('Val1', 'Val2', 'Val3', 'Val4')
df2 = pd.DataFrame()
for i in range( len(groups)-1):
df2 = df2.append( df.groupby('ID')[groups].corr().stack()
.loc[:,groups[i],groups[i+1]:].reset_index() )
df2.columns = ['ID', 'v1', 'v2', 'corr']
df2.set_index(['ID','v1','v2']).sort_index()
Note that if we didn't have the groupby
element, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.
Upvotes: 69
Reputation: 912
In the above answer; since ix has been depreciated use iloc instead with some minor other changes:
df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2][['Val2']] # to get pandas DataFrame
or
df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2]['Val2'] # to get pandas Series
Upvotes: 6