Reputation: 3374
I want to find the rank correlation of various columns in a data.frame using dplyr.
I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.
For the following df:
df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))
I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????
cor <- df %>% group_by(Universe) %>%
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)
I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.
Please help!
Upvotes: 2
Views: 6212
Reputation: 3374
So here follows the winning (time-wise) solution to my problem:
d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>%
summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>%
unite(Pair, R1, R2, sep="_")
Albeit 0.005 milliseconds in this example, adding data adds time.
Upvotes: 3
Reputation: 5530
An alternative approach is to just call the cor
function once since this will calculate all required correlations. Repeated calls to cor
might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:
#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%
data.frame(row=rep(colnames(.)[-1], n_groups(.)))
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <- which(upper.tri(diag(num_col)))
cor_cols <- df %>% group_by(Universe) %>%
do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])
Upvotes: 6
Reputation: 19005
Try this:
library(data.table) # needed for fast melt
setDT(df) # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))
mdf %>%
mutate(obs_set = substr(variable, 4, 4) ) %>% # ".x" or ".y" subgroup
full_join(.,., by=c('Universe', 'obs_set', 'id')) %>% # see notes
group_by(Universe, variable.x, variable.y) %>%
filter(variable.x != variable.y) %>%
dplyr::summarise(rank_corr = cor(value.x, value.y,
method='spearman', use='pairwise.complete.obs'))
Produces:
Universe variable.x variable.y rank_corr
(fctr) (fctr) (fctr) (dbl)
1 A AA.x BB.x -0.9
2 A AA.x CC.x -0.9
3 A BB.x AA.x -0.9
4 A BB.x CC.x 0.8
5 A CC.x AA.x -0.9
6 A CC.x BB.x 0.8
7 A AA.y BB.y -0.3
8 A AA.y CC.y 0.2
9 A BB.y AA.y -0.3
10 A BB.y CC.y -0.3
.. ... ... ... ...
Explanation:
Melt: converts table to long form, one row per observation. To do the melt in a dplyr
chain, you would have to use tidyr::gather
, I believe, so pick your dependency. Using data.table
there is faster and not hard to understand. The step also creates an id
for each observation, 1 to nrow(df)
. The rest is in dplyr
like you wanted.
Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe
and observation id
(edit: and now '.x' or '.y' subgroup).
Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.
Summarize using Spearman correlation. Note you should use dplyr::summarise
since if you have plyr
also loaded you might accidentally call plyr::summarise
.
Upvotes: 2