TooTone
TooTone

Reputation: 8126

Rank correlation matrix in R

How do I produce a rank correlation matrix in an elegant way in R given a data frame with many columns? I couldn't find a built-in function, so I tried

> test=data.frame(x=c(1,2,3,4,5), y=c(5,4,3,2,1))
> cor(rank(test))

(only 2 columns for simplicity, real data has 5 columns) which gave

> Error in cor(rank(test)) : supply both 'x' and 'y' or a matrix-like 'x'

I figured that this was because rank takes a single vector. So then I tried

> cor(lapply(test,rank))

to get rank applied to each column in the data frame, treating the data frame as a list, which gave the error

> supply both 'x' and 'y' or a matrix-like 'x'

and I finally ended up getting something working with

> cor(data.frame(lapply(test,rank)))
   x  y
x  1 -1
y -1  1

However this seems pretty verbose and ugly. I'm thinking there must be a better way -- if so what?

Upvotes: 4

Views: 7878

Answers (1)

Dirk is no longer here
Dirk is no longer here

Reputation: 368301

You are doing it wrong -- use the kendall method argument for cor() instead:

R> testdf <- data.frame(x=c(1,2,3,4,5), y=c(5,4,3,2,1))  
R> cor(testdf, method="kendall") 
   x  y 
x  1 -1    
y -1  1   
R> 

From help(cor):

For cor(), if method is "kendall" or "spearman", Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution. For cov(), a non-Pearson method is unusual but available for the sake of completeness. Note that "spearman" basically computes cor(R(x), R(y)) (or cov(.,.)) where R(u) := rank(u, na.last="keep"). In the case of missing values, the ranks are calculated depending on the value of use, either based on complete observations, or based on pairwise completeness with reranking for each pair.

Upvotes: 6

Related Questions