Reputation: 4700
I have a dataframe df
that looks like:
0 1 2 3 4 5 6 7 8
0 2014-03-19T12:44:32.695Z 1395233072695 703425 0 2 1 13 5 21
1 2014-03-19T12:44:32.727Z 1395233072727 703425 0 2 1 13 5 21
The columns are all type int (except the first one):
0 object
1 int64
2 int64
3 int64
4 int64
5 int64
6 int64
7 int64
8 int64
But in my correlation, some columns seem to be NaN. When I call df.corr()
, I get the following output:
1 2 3 4 5 6 7 8
1 1.000000 NaN 0.018752 -0.550307 NaN NaN 0.075191 0.775725
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 0.018752 NaN 1.000000 -0.067293 NaN NaN -0.579651 0.004593
...
Upvotes: 26
Views: 48934
Reputation: 23091
As @Karl D. mentioned, if the values in a column are the same, then there's no variance in that column, i.e. the values are constant and the correlation between a random variable and a constant is undefined. Another way to look at it is if a column with zero variance is plotted in a graph, it will be a horizontal line, which means it doesn't "move together" with any other plot on the same graph.
If we're interested in a correlation matrix, then one way is to simply drop the columns where correlation is not defined:
corr = df.corr()
corr_notna = ~corr.isna().all()
corr = corr.loc[corr_notna, corr_notna]
# or as a one-liner
corr = df.corr().dropna(how='all', axis=1).dropna(how='all')
The difference between a constant column and another random variable will have correlation = 1 or -1. For example,
df = pd.DataFrame({'A': [1]*5, 'B': [*[1]*4, 0]})
df.corr()
df.corr()
A B
A NaN NaN
B NaN 1.0
df['B-A'] = df['B'] - df['A']
df[['B', 'B-A']].corr()
B B-A
B 1.0 1.0
B-A 1.0 1.0
Another way to get NaN is if there are not enough data to compute correlation especially if min_periods=
was set to a number. An example:
df = pd.DataFrame({'A': [*range(9), float('nan')], 'B': range(10)})
df.corr(min_periods=10)
A B
A NaN NaN
B NaN 1.0
Upvotes: 1
Reputation: 13757
Those columns do not change in value right now, yes
As, Joris points out you would expected NaN
if the values do not vary. To see why take a look at correlation formula:
cor(i,j) = cov(i,j)/[stdev(i)*stdev(j)]
If the values of the ith or jth variable do not vary, then the respective standard deviation will be zero and so will the denominator of the fraction. Thus, the correlation will be NaN
.
Upvotes: 45