Reputation: 4274
I have a dataframe like this:
tmp <- read.table(header = T, text = "gene_id gene_symbol ensembl_id keep val1 val2 val3
x a Multiple Yes 1 2 3
x1 a Multiple No 2 3 4
x2 a Multiple No 1 4 3
y b Multiple Yes 22 20 12
y1 b Multiple No 98 7 97
y2 b Multiple No 8 76 6")
I am trying to group by the gene_symbol
variable and calculating correlation between each row that is keep == "Yes"
with all other rows (keep == "No"
) and returning an average correlation along with the gene_symbol
and gene_id
. This is the function:
# function to calculate avg. correlation
calc.mean.corr <- function(x){
gene.id <- x[which(x$keep == "Yes"),"gene_id"]
x1 <- x %>%
filter(keep == "Yes") %>%
select(-c(gene_id, gene_symbol, ensembl_id, keep)) %>%
as.numeric()
x2 <- x %>%
filter(keep == "No") %>%
select(-c(gene_id, gene_symbol, ensembl_id, keep))
# correlation of kept id with discarded ids
cor <- mean(apply(x2, 1, FUN = function(y) cor(x1, y)))
cor <- round(cor, digits = 2)
df <- data.frame(avg.cor = cor, gene_id = gene.id)
return(df)
}
# call using ddply
for.corr <- plyr::ddply(tmp, .variables = "gene_symbol", .fun = function(x) calc.mean.corr(x))
The final output looks like this:
> for.corr
gene_symbol avg.cor gene_id
1 a 0.83 x
2 b 0.02 y
I am using plyr::ddply
for this but want to use dplyr
instead. However, I am not sure how to convert it to dplyr format. Any help would be much appreciated.
Upvotes: 1
Views: 213
Reputation: 887048
If we don't want to change the function, one option it to do a group_split
and apply the function
library(dplyr)
library(purrr)
tmp %>%
group_split(gene_symbol) %>%
map_dfr(calc.mean.corr)
To include the gene_symbol
tmp %>%
split(.$gene_symbol) %>%
map_dfr(~ calc.mean.corr(.), .id = 'gene_symbol')
# gene_symbol avg.cor gene_id
#1 a 0.83 x
#2 b 0.02 y
Upvotes: 2