Nick
Nick

Reputation: 3374

Correlation using funs in dplyr

I want to find the rank correlation of various columns in a data.frame using dplyr.

I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.

For the following df:

df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))

I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????

cor <- df %>% group_by(Universe) %>% 
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)

I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.

Please help!

Upvotes: 2

Views: 6212

Answers (3)

Nick
Nick

Reputation: 3374

So here follows the winning (time-wise) solution to my problem:

d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>% 
       summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>% 
       unite(Pair, R1, R2, sep="_")

Albeit 0.005 milliseconds in this example, adding data adds time.

Upvotes: 3

WaltS
WaltS

Reputation: 5530

An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:

#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
              do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%  
              data.frame(row=rep(colnames(.)[-1], n_groups(.))) 
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <-  which(upper.tri(diag(num_col))) 
cor_cols <- df %>% group_by(Universe) %>%
            do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])

Upvotes: 6

C8H10N4O2
C8H10N4O2

Reputation: 19005

Try this:

library(data.table)                                           # needed for fast melt
setDT(df)                                                     # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))

mdf %>% 
  mutate(obs_set = substr(variable, 4, 4) ) %>%               # ".x" or ".y" subgroup
  full_join(.,., by=c('Universe', 'obs_set', 'id')) %>%       # see notes
  group_by(Universe, variable.x, variable.y) %>%
  filter(variable.x != variable.y) %>%
  dplyr::summarise(rank_corr = cor(value.x, value.y, 
                   method='spearman', use='pairwise.complete.obs'))

Produces:

   Universe variable.x variable.y rank_corr
     (fctr)     (fctr)     (fctr)     (dbl)
1         A       AA.x       BB.x      -0.9
2         A       AA.x       CC.x      -0.9
3         A       BB.x       AA.x      -0.9
4         A       BB.x       CC.x       0.8
5         A       CC.x       AA.x      -0.9
6         A       CC.x       BB.x       0.8
7         A       AA.y       BB.y      -0.3
8         A       AA.y       CC.y       0.2
9         A       BB.y       AA.y      -0.3
10        A       BB.y       CC.y      -0.3
..      ...        ...        ...       ...

Explanation:

  1. Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.

  2. Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).

  3. Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.

  4. Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.

Upvotes: 2

Related Questions