Joshua Rosenberg
Joshua Rosenberg

Reputation: 4226

Confusing correlation output from stats::cor() with dichotomous data in R

I have the data.frame df with three variables with values of "1" or "0" and no rows with more than one of the variables with a "1":

> df <- structure(list(var1 = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0), var2 = c(1, 
0, 0, 0, 0, 0, 0, 0, 0, 0), var3 = c(0, 1, 0, 1, 0, 1, 0, 0, 
0, 1)), .Names = c("var1", "var2", "var3"), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

> df

   var1 var2 var3
1     0    1    0
2     0    0    1
3     0    0    0
4     0    0    1
5     1    0    0
6     0    0    1
7     0    0    0
8     1    0    0
9     0    0    0
10    0    0    1

The row sums are less than 1 for all of the rows:

> rowSums(df)
 [1] 1 1 0 1 1 1 0 1 0 1

When I look at the correlations (I used the "spearman" argument because the data are "1"s and "0"s), the output is confusing because there are correlations that are non-zero:

cor(df, method = "spearman")
           var1       var2       var3
var1  1.0000000 -0.1666667 -0.4082483
var2 -0.1666667  1.0000000 -0.2721655
var3 -0.4082483 -0.2721655  1.0000000

I wondered if this was some strange side-effect of stats::cor(), so I tried Hmisc::rcorr() with the same result:

> Hmisc::rcorr(as.matrix(df), type = "spearman")
      var1  var2  var3
var1  1.00 -0.17 -0.41
var2 -0.17  1.00 -0.27
var3 -0.41 -0.27  1.00

Shouldn't the correlations between all three variables be 0 because there are no rows in which more than one variable has a value of "1"? Am I misunderstanding how correlations work in some profound way? Or am I using these functions incorrectly?

Upvotes: 0

Views: 110

Answers (1)

Consistency
Consistency

Reputation: 2922

Your observation of the row sums to be all smaller than 1 actually implies that there is some negative correlation between the variables, because the meaning of negative correlation is one variable bigger (in your case 1), one variable smaller (in your case 0), which is in agree with your results.

Your confusion might arise because of the inner product of any of the two variables to be zero, but inner product to be zero doesn't mean there is no correlation (it only means there is no linear correlation only when every variable is standardized to have mean zero, which your case certainly is not).

Upvotes: 1

Related Questions