Victor Roos
Victor Roos

Reputation: 45

Add a column with the highest correlation to the previously ranked variables in pandas

Let's say I have a dataframe (sorted on rank) that looks as follows:

    Num     score   rank
0   1       1.937   6.0
0   2       1.819   5.0
0   3       1.704   4.0
0   6       1.522   3.0
0   4       1.396   2.0
0   5       1.249   1.0

and I want to add a column that displays the highest correlation of that variable to the variables that were ranked above. With the said correlation matrix being:

        Num1    Num2   Num3    Num4    Num5    Num6
Num1    1.0     0.976  0.758   0.045   0.137   0.084
Num2    0.976   1.0    0.749   0.061   0.154   0.096
Num3    0.758   0.749  1.0     -0.102  0.076   -0.047
Num4    0.045   0.061  -0.102  1.0     0.917   0.893
Num5    0.137   0.154  0.076   0.917   1.0     0.863
Num6    0.084   0.096  -0.047  0.893   0.863   1.0

I would expect to get:

    Num     score   rank  highestcor
0   1       1.937   6.0   NaN
0   2       1.819   5.0   0.976
0   3       1.704   4.0   0.758
0   6       1.522   3.0   0.096
0   4       1.396   2.0   0.893
0   5       1.249   1.0   0.917

How would I go about this in an efficient way?

Upvotes: 2

Views: 129

Answers (1)

Code Different
Code Different

Reputation: 93151

Here's one way to do it in numpy:

# Convert the correlation dataframe to numpy array
corr = corr_df.to_numpy()

# Fill the diagonal with negative infinity
np.fill_diagonal(corr, -np.inf)

# Rearrange the correlation matrix in Rank order. I assume
# df["Num"] column contains number 1 to n
num = df["Num"] - 1
corr = corr[num, :]
corr = corr[:, num]

# Mask out the upper triangle with -np.inf because these
# columns rank lower than the current row. `triu` = triangle
# upper
triu_index = np.triu_indices_from(corr)
corr[triu_index] = -np.inf

# And the highest correlation is simply the max of the
# remaining columns
highest_corr = corr.max(axis=1)
highest_corr[0] = np.nan

df["highest_corr"] = highest_corr

Upvotes: 1

Related Questions