Reputation: 53
I'm relatively new in R and I have a question about merging two data frames, which does contain similar numeric data from two domains (mz and rt) but not the same. Here an example which describes my problem:
mz1 <- c(seq(100, 190, by = 10))
rt1 <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2 <- mz1 + runif(10, -0.1, 0.1)
rt2 <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)
df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))
df1
mz1 rt1 value1
1 100 1 44605.646
2 110 2 13924.598
3 120 3 35727.265
4 130 4 75175.652
5 140 5 25221.724
6 150 6 29080.653
7 160 7 3170.749
8 170 8 10184.708
9 180 9 48055.072
10 190 10 77644.865
df2
mz2 rt2 value2
1 100.0243 1.043092 58099.49
2 110.0514 2.164753 76397.67
3 120.0258 2.838141 43901.05
4 130.0921 4.044322 34543.96
5 139.9577 5.023823 53086.10
6 150.0170 6.061794 13929.27
7 160.0884 6.828779 60905.61
8 170.0440 7.932000 66627.20
9 180.0872 9.116425 44587.62
10 189.9694 9.834091 51186.03
I want to merge all rows from df1 and df2 which have a difference <= 0.1 in the rt domain and a difference <= 0.05 in the mz domain. In addition, if there are two or more rows which fulfill this criteria the row with the smallest distance to both domains should be merged (maybe an additional calculation is necessary: distance = sqrt(mz^2+rt^2)) and the remaining rows have to find a different merging partner if existing. If there is no merging partner keep the row and fill "NA" to the missing value.
What I have tried so far:
merge.data.frame(df1, df2, by.x = c("mz1", "rt1"), by.y = c("mz2", "rt2") , all = T)
mz1 rt1 value1 rt2 value2
1 100.0000 1 44605.646 NA NA
2 100.0243 NA NA 1.043092 58099.49
3 110.0000 2 13924.598 NA NA
4 110.0514 NA NA 2.164753 76397.67
5 120.0000 3 35727.265 NA NA
6 120.0258 NA NA 2.838141 43901.05
7 130.0000 4 75175.652 NA NA
8 130.0921 NA NA 4.044322 34543.96
9 139.9577 NA NA 5.023823 53086.10
10 140.0000 5 25221.724 NA NA
11 150.0000 6 29080.653 NA NA
12 150.0170 NA NA 6.061794 13929.27
13 160.0000 7 3170.749 NA NA
14 160.0884 NA NA 6.828779 60905.61
15 170.0000 8 10184.708 NA NA
16 170.0440 NA NA 7.932000 66627.20
17 180.0000 9 48055.072 NA NA
18 180.0872 NA NA 9.116425 44587.62
19 189.9694 NA NA 9.834091 51186.03
20 190.0000 10 77644.865 NA NA
This gives me at least a data frame in the right format, which contains NA's where no merging was possible.
It would be awesome if someone could me help with this problem!
Greetings
Update
Alright, I will keep that in mind. Thank you so far. I have tried the following as an idea:
#select data in joined which has no partner
no_match_df1 <- anti_join(joined, df2)
no_match_df1 <- no_match_df1[1:3]
#select data in df2 which has been excluded due to duplication
collist <- c("mz2", "rt2", "value2")
dublicates <- joined[complete.cases(joined[collist]), collist]
dublicates <- anti_join(df2, dublicates)
#repetition for joining
joined2 <- fuzzy_join(no_match_df1, dublicates, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
multi_match_fun = mmf, mode = "full")
joined2 <- group_by(joined2, mz1, rt1) %>%
mutate(min_dist = min(dist))
head(joined2)
joined2 <- filter(joined2, dist == min_dist | is.na(dist)) %>%
select(-dist, -min_dist)
head(joined2)
#select only rows with new match or where dublicates coulnd't find a partner
add <- subset(joined2, !is.na(joined2$mz2) | !is.na(joined2$mz2) & !is.na(joined2$mz1))
#add to joined
##I need some help here, how can I update the existing joined data frame?
Maybe we can join the no_match_df1
with the duplicates
as we have done it before and just add the results by overwriting the particularly rows in the existing joined
data frame.
Finally, we have to repeat that process as log as the length of duplicates
is >1.
Upvotes: 3
Views: 5604
Reputation: 15927
Following the advice by joran, I found a solution using the fuzzyjoin
package. I created the data sets as follows:
set.seed(123)
mz1 <- c(seq(100, 190, by = 10))
rt1 <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2 <- mz1 + runif(10, -0.1, 0.1)
rt2 <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)
df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))
(A little side remark: you made an excellent reproducible example. The only weakness is that you did not set a seed, which is the only difference of the above to your code.)
To make sure that there is a case, where two matches are found, I add an additional line to df2
:
df2 <- rbind(df2, c(180.001, 9.09, 0))
Now, I can use the function fuzzy_join()
to merge the data frames:
library(fuzzyjoin)
joined <- fuzzy_join(df1, df2, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
multi_match_fun = mmf, mode = "full")
Note that the syntax is very similiar to join()
from dplyr
. There is a crucial difference however: you can provide a function for multi_match_fun
, which determines whether two rows match. It returns a data frame, where the first column must be logical. This column determines, whether two rows match or not. All the other columns are simply added to the resulting data frame. I defined this function as follows:
mmf <- function(x, y) {
mz_dist <- abs(x[, 1] - y[, 1])
rt_dist <- abs(x[, 2] - y[, 2])
out <- data_frame(merge = rt_dist <= 0.1 & mz_dist < 0.05,
dist = sqrt(mz_dist^2 + rt_dist^2))
return (out)
}
You can see that the column merge
(the name is arbitrary) is TRUE
exactly if the conditions you specified are satisfied. In addition, a column containing the distance is added for later use. I set mode = "full"
in order to have the NA
values, if there is no match.
The result looks as follows:
head(joined)
## mz1 rt1 value1 mz2 rt2 value2 dist
## 1 110 2 78851.68 109.9907 2.077121 90239.67 0.07768406
## 2 120 3 40956.79 120.0355 3.056203 69101.46 0.06648308
## 3 180 9 55188.36 179.9656 8.915664 31886.28 0.09108803
## 4 180 9 55188.36 180.0010 9.090000 0.00 0.09000556
## 5 100 1 28828.99 NA NA NA NA
## 6 130 4 88313.44 NA NA NA NA
In rows 3 and 4 you can see, that there were indeed two matches in this case. And from the column dist
you can see that row 4 is the one we want to keep. This means that row 3 should be treated as not finding a match, and the columns mz1
, rt1
, and value1
should be filled with NA
. I did this by grouping the rows by mz1
and rt1
and then adding the minimum value of distance for each group:
library(dplyr)
joined <- group_by(joined, mz1, rt1) %>%
mutate(min_dist = min(dist))
head(joined)
## Source: local data frame [6 x 8]
## Groups: mz1, rt1 [5]
##
## mz1 rt1 value1 mz2 rt2 value2 dist min_dist
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 110 2 78851.68 109.9907 2.077121 90239.67 0.07768406 0.07768406
## 2 120 3 40956.79 120.0355 3.056203 69101.46 0.06648308 0.06648308
## 3 180 9 55188.36 179.9656 8.915664 31886.28 0.09108803 0.09000556
## 4 180 9 55188.36 180.0010 9.090000 0.00 0.09000556 0.09000556
## 5 100 1 28828.99 NA NA NA NA NA
## 6 130 4 88313.44 NA NA NA NA NA
The rows with valid matches are all those, where dist
is the same as min_dist
. In addition, we also should also not loose the rows where dist
is NA
. This can be done as follows:
dbls <- which(joined$dist != joined$min_dist)
joined[dbls, c("mz1", "rt1", "value1")] <- NA
joined <- select(joined, -dist, -min_dist)
head(joined)
## Source: local data frame [6 x 6]
## Groups: mz1, rt1 [6]
##
## mz1 rt1 value1 mz2 rt2 value2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 110 2 78851.68 109.9907 2.077121 90239.67
## 2 120 3 40956.79 120.0355 3.056203 69101.46
## 3 NA NA NA 179.9656 8.915664 31886.28
## 4 180 9 55188.36 180.0010 9.090000 0.00
## 5 100 1 28828.99 NA NA NA
## 6 130 4 88313.44 NA NA NA
Depending on how your data looks, it might also be possible, that in the case of a double match, the values of mz1
and rt1
do not agree, but another pair of values does. You would then have to repeat the above step with other groupings as well.
Upvotes: 2