Reputation: 10697
Let's suppose I have a correlation matrix that looks like this:
df = pd.DataFrame(data={'a':[1,0.2,0.3,0.4],'b':[0.2,1,0.5,0.6],'c':[0.3,0.5,1,0.7],'d':[0.4,0.6,0.7,1]}, index=['a','b','c','d'])
what is the best way to extract the unique values of each pairwise combination (a-b, a-c, etc)?
df2 =
a_b a_c a_d b_c b_d c_d
0.2 0.3 0.4 0.5 0.6 0.7
the only way I see doing this is to write my own function, but was wondering if someone knows a shortcut for this
Upvotes: 5
Views: 1653
Reputation: 5911
if the correlations are corrs
(e.g from corrs = df.corr()
), then the unique correlation values are:
upper_right_entries = np.triu_indices(len(corrs), 1)
corrs.values[upper_right_entries]
this uses numpy.triu_indices
, which produces a list of indices to fetch all upper right entries for a 2-d array. the 1
argument excludes the main diagonal (which in a correlation matrix, is 1.0
).
h/t to @Ji Ma for his answer using np.tril
. my solution is shorter and easier to understand, i think.
Upvotes: 0
Reputation: 81
You can use matrix effectively:
import numpy as np
df = pd.DataFrame(data={'a':[1,0.2,0.3,0.4],'b':[0.2,1,0.5,0.6],'c':[0.3,0.5,1,0.7],'d':[0.4,0.6,0.7,1]}, index=['a','b','c','d'])
unique_values=[s for s in np.tril(df, k=-1).flatten() if s!=0]
print(unique_values)
It gives you: [0.2, 0.3, 0.5, 0.4, 0.6, 0.7]
The key is the np.tril function.
Upvotes: 1
Reputation: 153460
IIUC:
df_out = df.stack()
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T
Output:
a_a a_b a_c a_d b_a b_b b_c b_d c_a c_b c_c c_d d_a d_b d_c
0 1.0 0.2 0.3 0.4 0.2 1.0 0.5 0.6 0.3 0.5 1.0 0.7 0.4 0.6 0.7
And, if you want to get rid of a_a, b_b, etc..
df_out = df.stack()
df_out = df_out[df_out.index.get_level_values(0) != df_out.index.get_level_values(1)]
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T
Output
a_b a_c a_d b_a b_c b_d c_a c_b c_d d_a d_b d_c
0 0.2 0.3 0.4 0.2 0.5 0.6 0.3 0.5 0.7 0.4 0.6 0.7
Or to get rid of b_a and keep a_b:
df_out = df.stack()
df_out = df_out[df_out.index.get_level_values(0) < df_out.index.get_level_values(1)]
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T
Or combining a few lines using lambda function in .loc
:
df_out = df.stack().loc[lambda x: x.index.get_level_values(0) < x.index.get_level_values(1)]
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T
Output:
a_b a_c a_d b_c b_d c_d
0 0.2 0.3 0.4 0.5 0.6 0.7
Upvotes: 10
Reputation: 59264
IIUC, you can play with indexes
df2 = df.unstack().reset_index()
s = df2[['level_0', 'level_1']].agg(frozenset,1).drop_duplicates()
df2 = df2.loc[s.index]
ind = df2.agg(lambda k: (k['level_0']+'_'+k['level_1']), axis=1)
df2.set_index(ind)[0].to_frame().T
a_a a_b a_c a_d b_b b_c b_d c_c c_d d_d
0 1.0 0.2 0.3 0.4 1.0 0.5 0.6 1.0 0.7 1.0
Upvotes: 4