Reputation: 1227
I have a dataframe target
that contains columns SNP
and value
:
target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
"value" = 1:6)
I have 3 other dataframes that contains columns SNP
and int
as a list:
ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)
I want to add a new column int
for target
whose values correspond to the int
values of the ref1/2/3 with the same SNP
. For example, the first int
value for target
should be 7 because row 2 of ref1 has SNP
of rs2 and int
of 7.
I tried the following code:
for (i in 1:3) {
target <- target %>%
left_join(mylist[[i]], by = "SNP")
}
The matching was quick and successful. However, I was returned 3 new columns rather than 1, shown as below:
I then used the following code:
target[, "ref"] <- NA
for (i in 1:3) {
common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))
tar.pos <- match(common, target$SNP)
ref.pos <- match(common, mylist[[i]]$SNP)
target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}
In my real data, I have 22 ref dataframes with each have 1-6 MILLION rows. I would prefer to do the matching and joining ref by ref rather than merging all refs into one single large data. When I tried the second method above on my real data, I noted that the match
function worked very slow. This is why I prefer some smart way of doing the work. I found left_join worked very fast even for my large data. Unfortunately, the output is not exactly what I want.
I wish to do the above work in a fast way, preferably in tidyverse. Is there any suggestion on how do revise the first coding method or any other smarter way?
Upvotes: 2
Views: 129
Reputation: 35554
If binding all data in mylist
and merging to target
takes too much memory, you can use purrr::reduce
to merge one by one.
library(tidyverse)
reduce(mylist,
~ left_join(.x, .y, by = "SNP") %>%
mutate(int = coalesce(int.x, int.y)) %>%
select(-c(int.x, int.y)),
.init = mutate(target, int = NA_real_))
# SNP value int
# 1 rs2 1 7
# 2 rs4 2 4
# 3 rs6 3 22
# 4 rs19 4 NA
# 5 rs8 5 88
# 6 rs9 6 23
Upvotes: 1
Reputation: 887128
With tidyverse
, we can also do
library(dplyr)
bind_rows(mylist) %>%
right_join(target, by = "SNP")
Upvotes: 1
Reputation: 388982
You could convert mylist
into one dataframe and then merge
with target
merge(target, do.call(rbind, mylist), by = "SNP", all.x = TRUE)
# SNP value int
#1 rs19 4 NA
#2 rs2 1 7
#3 rs4 2 4
#4 rs6 3 22
#5 rs8 5 88
#6 rs9 6 23
Or using dplyr
library(dplyr)
left_join(target, bind_rows(mylist), by = "SNP")
Or in data.table
library(data.table)
rbindlist(mylist)[target, on = 'SNP']
Upvotes: 0