Reputation: 8126
How do I produce a rank correlation matrix in an elegant way in R given a data frame with many columns? I couldn't find a built-in function, so I tried
> test=data.frame(x=c(1,2,3,4,5), y=c(5,4,3,2,1))
> cor(rank(test))
(only 2 columns for simplicity, real data has 5 columns) which gave
> Error in cor(rank(test)) : supply both 'x' and 'y' or a matrix-like 'x'
I figured that this was because rank
takes a single vector. So then I tried
> cor(lapply(test,rank))
to get rank applied to each column in the data frame, treating the data frame as a list, which gave the error
> supply both 'x' and 'y' or a matrix-like 'x'
and I finally ended up getting something working with
> cor(data.frame(lapply(test,rank)))
x y
x 1 -1
y -1 1
However this seems pretty verbose and ugly. I'm thinking there must be a better way -- if so what?
Upvotes: 4
Views: 7878
Reputation: 368301
You are doing it wrong -- use the kendall
method argument for cor()
instead:
R> testdf <- data.frame(x=c(1,2,3,4,5), y=c(5,4,3,2,1))
R> cor(testdf, method="kendall")
x y
x 1 -1
y -1 1
R>
From help(cor)
:
For
cor()
, if method is"kendall"
or"spearman"
, Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution. Forcov()
, a non-Pearson method is unusual but available for the sake of completeness. Note that"spearman"
basically computescor(R(x), R(y))
(orcov(.,.)
) whereR(u) := rank(u, na.last="keep")
. In the case of missing values, the ranks are calculated depending on the value of use, either based on complete observations, or based on pairwise completeness with reranking for each pair.
Upvotes: 6