Marco
Marco

Reputation: 53

How to merge two data frames based on similar values in R

I'm relatively new in R and I have a question about merging two data frames, which does contain similar numeric data from two domains (mz and rt) but not the same. Here an example which describes my problem:

mz1    <- c(seq(100, 190, by = 10))
rt1    <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2    <- mz1 + runif(10, -0.1, 0.1)
rt2    <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)

df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))


df1
   mz1 rt1    value1
1  100   1 44605.646
2  110   2 13924.598
3  120   3 35727.265
4  130   4 75175.652
5  140   5 25221.724
6  150   6 29080.653
7  160   7  3170.749
8  170   8 10184.708
9  180   9 48055.072
10 190  10 77644.865


df2
        mz2      rt2   value2
1  100.0243 1.043092 58099.49
2  110.0514 2.164753 76397.67
3  120.0258 2.838141 43901.05
4  130.0921 4.044322 34543.96
5  139.9577 5.023823 53086.10
6  150.0170 6.061794 13929.27
7  160.0884 6.828779 60905.61
8  170.0440 7.932000 66627.20
9  180.0872 9.116425 44587.62
10 189.9694 9.834091 51186.03

I want to merge all rows from df1 and df2 which have a difference <= 0.1 in the rt domain and a difference <= 0.05 in the mz domain. In addition, if there are two or more rows which fulfill this criteria the row with the smallest distance to both domains should be merged (maybe an additional calculation is necessary: distance = sqrt(mz^2+rt^2)) and the remaining rows have to find a different merging partner if existing. If there is no merging partner keep the row and fill "NA" to the missing value.

What I have tried so far:

merge.data.frame(df1, df2, by.x = c("mz1", "rt1"), by.y = c("mz2", "rt2") , all = T)

        mz1 rt1    value1      rt2   value2
1  100.0000   1 44605.646       NA       NA
2  100.0243  NA        NA 1.043092 58099.49
3  110.0000   2 13924.598       NA       NA
4  110.0514  NA        NA 2.164753 76397.67
5  120.0000   3 35727.265       NA       NA
6  120.0258  NA        NA 2.838141 43901.05
7  130.0000   4 75175.652       NA       NA
8  130.0921  NA        NA 4.044322 34543.96
9  139.9577  NA        NA 5.023823 53086.10
10 140.0000   5 25221.724       NA       NA
11 150.0000   6 29080.653       NA       NA
12 150.0170  NA        NA 6.061794 13929.27
13 160.0000   7  3170.749       NA       NA
14 160.0884  NA        NA 6.828779 60905.61
15 170.0000   8 10184.708       NA       NA
16 170.0440  NA        NA 7.932000 66627.20
17 180.0000   9 48055.072       NA       NA
18 180.0872  NA        NA 9.116425 44587.62
19 189.9694  NA        NA 9.834091 51186.03
20 190.0000  10 77644.865       NA       NA

This gives me at least a data frame in the right format, which contains NA's where no merging was possible.

It would be awesome if someone could me help with this problem!

Greetings

Update

Alright, I will keep that in mind. Thank you so far. I have tried the following as an idea:

#select data in joined which has no partner
no_match_df1 <- anti_join(joined, df2)
no_match_df1 <- no_match_df1[1:3]

#select data in df2 which has been excluded due to duplication
collist <- c("mz2", "rt2", "value2")
dublicates <- joined[complete.cases(joined[collist]), collist]
dublicates <- anti_join(df2, dublicates)


#repetition for joining
joined2 <- fuzzy_join(no_match_df1, dublicates, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
                     multi_match_fun = mmf, mode = "full")

joined2 <- group_by(joined2, mz1, rt1) %>%
  mutate(min_dist = min(dist))
head(joined2)

joined2 <- filter(joined2, dist == min_dist | is.na(dist)) %>%
  select(-dist, -min_dist)
head(joined2)

#select only rows with new match or where dublicates coulnd't find a partner

add <- subset(joined2, !is.na(joined2$mz2) | !is.na(joined2$mz2) &  !is.na(joined2$mz1))

#add to joined
##I need some help here, how can I update the existing joined data frame?

Maybe this helps

Maybe we can join the no_match_df1 with the duplicates as we have done it before and just add the results by overwriting the particularly rows in the existing joined data frame. Finally, we have to repeat that process as log as the length of duplicates is >1.

Upvotes: 3

Views: 5604

Answers (1)

Stibu
Stibu

Reputation: 15927

Following the advice by joran, I found a solution using the fuzzyjoin package. I created the data sets as follows:

set.seed(123)
mz1    <- c(seq(100, 190, by = 10))
rt1    <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2    <- mz1 + runif(10, -0.1, 0.1)
rt2    <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)

df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))

(A little side remark: you made an excellent reproducible example. The only weakness is that you did not set a seed, which is the only difference of the above to your code.)

To make sure that there is a case, where two matches are found, I add an additional line to df2:

df2 <- rbind(df2, c(180.001, 9.09, 0))

Now, I can use the function fuzzy_join() to merge the data frames:

library(fuzzyjoin)
joined <- fuzzy_join(df1, df2, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
                     multi_match_fun = mmf, mode = "full")

Note that the syntax is very similiar to join() from dplyr. There is a crucial difference however: you can provide a function for multi_match_fun, which determines whether two rows match. It returns a data frame, where the first column must be logical. This column determines, whether two rows match or not. All the other columns are simply added to the resulting data frame. I defined this function as follows:

mmf <- function(x, y) {
  mz_dist <- abs(x[, 1] - y[, 1])
  rt_dist <- abs(x[, 2] - y[, 2])

  out <- data_frame(merge = rt_dist <= 0.1 & mz_dist < 0.05,
                    dist = sqrt(mz_dist^2 + rt_dist^2))
  return (out)
}

You can see that the column merge (the name is arbitrary) is TRUE exactly if the conditions you specified are satisfied. In addition, a column containing the distance is added for later use. I set mode = "full" in order to have the NA values, if there is no match.

The result looks as follows:

head(joined)
##   mz1 rt1   value1      mz2      rt2   value2       dist
## 1 110   2 78851.68 109.9907 2.077121 90239.67 0.07768406
## 2 120   3 40956.79 120.0355 3.056203 69101.46 0.06648308
## 3 180   9 55188.36 179.9656 8.915664 31886.28 0.09108803
## 4 180   9 55188.36 180.0010 9.090000     0.00 0.09000556
## 5 100   1 28828.99       NA       NA       NA         NA
## 6 130   4 88313.44       NA       NA       NA         NA

In rows 3 and 4 you can see, that there were indeed two matches in this case. And from the column dist you can see that row 4 is the one we want to keep. This means that row 3 should be treated as not finding a match, and the columns mz1, rt1, and value1 should be filled with NA. I did this by grouping the rows by mz1 and rt1 and then adding the minimum value of distance for each group:

library(dplyr)
joined <- group_by(joined, mz1, rt1) %>%
            mutate(min_dist = min(dist))
head(joined)
## Source: local data frame [6 x 8]
## Groups: mz1, rt1 [5]
## 
##     mz1   rt1   value1      mz2      rt2   value2       dist   min_dist
##   <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
## 1   110     2 78851.68 109.9907 2.077121 90239.67 0.07768406 0.07768406
## 2   120     3 40956.79 120.0355 3.056203 69101.46 0.06648308 0.06648308
## 3   180     9 55188.36 179.9656 8.915664 31886.28 0.09108803 0.09000556
## 4   180     9 55188.36 180.0010 9.090000     0.00 0.09000556 0.09000556
## 5   100     1 28828.99       NA       NA       NA         NA         NA
## 6   130     4 88313.44       NA       NA       NA         NA         NA

The rows with valid matches are all those, where dist is the same as min_dist. In addition, we also should also not loose the rows where dist is NA. This can be done as follows:

dbls <- which(joined$dist != joined$min_dist)
joined[dbls, c("mz1", "rt1", "value1")] <- NA
joined <- select(joined, -dist, -min_dist)
head(joined)
## Source: local data frame [6 x 6]
## Groups: mz1, rt1 [6]
## 
##     mz1   rt1   value1      mz2      rt2   value2
##   <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1   110     2 78851.68 109.9907 2.077121 90239.67
## 2   120     3 40956.79 120.0355 3.056203 69101.46
## 3    NA    NA       NA 179.9656 8.915664 31886.28
## 4   180     9 55188.36 180.0010 9.090000     0.00
## 5   100     1 28828.99       NA       NA       NA
## 6   130     4 88313.44       NA       NA       NA

Depending on how your data looks, it might also be possible, that in the case of a double match, the values of mz1 and rt1 do not agree, but another pair of values does. You would then have to repeat the above step with other groupings as well.

Upvotes: 2

Related Questions