Reputation: 45
Let's say I have a dataframe (sorted on rank) that looks as follows:
Num score rank
0 1 1.937 6.0
0 2 1.819 5.0
0 3 1.704 4.0
0 6 1.522 3.0
0 4 1.396 2.0
0 5 1.249 1.0
and I want to add a column that displays the highest correlation of that variable to the variables that were ranked above. With the said correlation matrix being:
Num1 Num2 Num3 Num4 Num5 Num6
Num1 1.0 0.976 0.758 0.045 0.137 0.084
Num2 0.976 1.0 0.749 0.061 0.154 0.096
Num3 0.758 0.749 1.0 -0.102 0.076 -0.047
Num4 0.045 0.061 -0.102 1.0 0.917 0.893
Num5 0.137 0.154 0.076 0.917 1.0 0.863
Num6 0.084 0.096 -0.047 0.893 0.863 1.0
I would expect to get:
Num score rank highestcor
0 1 1.937 6.0 NaN
0 2 1.819 5.0 0.976
0 3 1.704 4.0 0.758
0 6 1.522 3.0 0.096
0 4 1.396 2.0 0.893
0 5 1.249 1.0 0.917
How would I go about this in an efficient way?
Upvotes: 2
Views: 129
Reputation: 93151
Here's one way to do it in numpy:
# Convert the correlation dataframe to numpy array
corr = corr_df.to_numpy()
# Fill the diagonal with negative infinity
np.fill_diagonal(corr, -np.inf)
# Rearrange the correlation matrix in Rank order. I assume
# df["Num"] column contains number 1 to n
num = df["Num"] - 1
corr = corr[num, :]
corr = corr[:, num]
# Mask out the upper triangle with -np.inf because these
# columns rank lower than the current row. `triu` = triangle
# upper
triu_index = np.triu_indices_from(corr)
corr[triu_index] = -np.inf
# And the highest correlation is simply the max of the
# remaining columns
highest_corr = corr.max(axis=1)
highest_corr[0] = np.nan
df["highest_corr"] = highest_corr
Upvotes: 1