DiegoIE
DiegoIE

Reputation: 107

Python Pandas pandas correlation one column vs all

I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.

I'm trying with this:

corr = IM['imdb_score'].corr(IM)

But I get the error

operands could not be broadcast together with shapes

which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.

How can this be fixed?

Upvotes: 8

Views: 8278

Answers (3)

Cleb
Cleb

Reputation: 25997

I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.

So, something like

IM.corr()['imbd_score']

should work.

Upvotes: 4

mozway
mozway

Reputation: 260300

The most efficient method it to use corrwith.

Example:

df.corrwith(df['A'])

Setup of example data:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))

#    A  B  C  D  E
# 0  7  2  0  0  0
# 1  4  4  1  7  2
# 2  6  2  0  6  6
# 3  9  8  0  2  1
# 4  6  0  9  7  7

output:

A    1.000000
B    0.526317
C   -0.209734
D   -0.720400
E   -0.326986
dtype: float64

Upvotes: 9

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:

import pandas as pd

df = pd.DataFrame()

df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)

pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])

Upvotes: 0

Related Questions