wolferman
wolferman

Reputation: 23

Merge data.frame in R

I have a question regarding a specific type of merge using data.frame in R (found a lot of similar problems, but couldn't get it to work for my specific problem)

Suppose I have two dataframes with tow columns X1,X2 each:

df1 =

            X1         X2
    1  '01.01.2000'    4
    2  '01.01.2001'    5
    3  '01.01.2002'    6

df2 =

            X1         X2
    1  '01.01.2002'    8
    2  '01.01.2003'    9
    3  '01.01.2004'    10

What I want is a merged dataframe according to the following rules:

  1. If a value in X1 is only in df1, use the value of X2 in df1
  2. If a value in X1 is in both df1 and df2 use the value of X2 from df2
  3. If a value in X1 is only in df2, use the value of X2 in df2

For df1 and df2 above, this would mean:

dfMerged =

            X1         X2
    1  '01.01.2000'    4
    2  '01.01.2001'    5
    3  '01.01.2002'    8
    4  '01.01.2003'    9
    5  '01.01.2004'    10

Currently, I'm using a very slow solution by merging first and then iterating over all rows. Also tried various approaches using dplyr::Union etc, but couldn't find a proper solution. Any help is greatly appreciated!

Upvotes: 2

Views: 367

Answers (2)

thothal
thothal

Reputation: 20329

Data

df1 <- structure(list(X1 = c("01.01.2000", "01.01.2001", "01.01.2002"), 
                      X2 = 4:6), 
                 class = "data.frame", 
                 row.names = c(NA, -3L))

df2 <- structure(list(X1 = c("01.01.2002", "01.01.2003", "01.01.2004"), 
                      X2 = 8:10), 
                 class = "data.frame", 
                 row.names = c(NA, -3L))

Code

library(dplyr)
full_join(df1, df2, by = "X1") %>%
    mutate(X2 = case_when(!is.na(X2.x) & !is.na(X2.y) ~ X2.y, 
                          is.na(X2.y)                 ~ X2.x, 
                          is.na(X2.x)                 ~ X2.y)) %>% 
    select(X1, X2)

Explanation

  1. First you do a full_join on both data sets with X1 as joining column. This creates column X2.x and X2.y which will bear the X2 values of the respective data sets.
  2. Then it is just an easy application of mutate to select the right column depending on the rules you gave.

Benchmark

The distinct solution is by any means faster by a factor of 3, as the following benchmark shows:

library(tidyverse)
library(microbenchmark)
make_data_frame <- function(n, percent_matching = .1) {
   ids_a <- ids_b <- paste0("ID_", seq.int(n))
   non_matching_ids <- sample(n, round(n * (1 - percent_matching), 0))
   ids_b[non_matching_ids] <- paste(ids_b[non_matching_ids], "b", sep = "_")
   list(A = data.frame(X1 = ids_a, X2 = "a", stringsAsFactors = FALSE),
        B = data.frame(X1 = ids_b, X2 = "b", stringsAsFactors = FALSE))
}

.distinct <- function(dfs) {
   bind_rows(dfs$B, dfs$A) %>% 
      distinct(X1, .keep_all = TRUE)
}

.join <- function(dfs) {
   full_join(dfs$A, dfs$B, by = "X1") %>%
      mutate(X2 = case_when(!is.na(X2.x) & !is.na(X2.y) ~ X2.y, 
                            is.na(X2.y)                 ~ X2.x, 
                            is.na(X2.x)                 ~ X2.y))
}
scenarios <- expand.grid(n = c(1e4, 1e5, 1e6),
                         percent_matching = c(.1, .5, .9))
all_data <- pmap(scenarios, make_data_frame)
all_mb <- map(all_data, ~ microbenchmark(.distinct(.x), .join(.x)))
map_dfr(seq.int(NROW(scenarios)), function(i) {
   mdat <- scenarios[i, ]
   my_summary <- summary(all_mb[[i]])
   rownames(mdat) <- NULL
   rownames(my_summary) <- NULL
   cbind(mdat, my_summary)
}) %>%
select(n, percent_matching, expr, mean)

#        n percent_matching          expr        mean
# 1  1e+04              0.1 .distinct(.x)    4.975013
# 2  1e+04              0.1     .join(.x)   12.587072
# 3  1e+05              0.1 .distinct(.x)   59.577142
# 4  1e+05              0.1     .join(.x)  149.987451
# 5  1e+06              0.1 .distinct(.x)    1.158597
# 6  1e+06              0.1     .join(.x)    2.699003
# 7  1e+04              0.5 .distinct(.x)    4.485196
# 8  1e+04              0.5     .join(.x)   11.902656
# 9  1e+05              0.5 .distinct(.x)   46.660016
# 10 1e+05              0.5     .join(.x)  132.180758
# 11 1e+06              0.5 .distinct(.x)  913.503111
# 12 1e+06              0.5     .join(.x) 2148.531600
# 13 1e+04              0.9 .distinct(.x)    4.299905
# 14 1e+04              0.9     .join(.x)   12.731292
# 15 1e+05              0.9 .distinct(.x)   37.558069
# 16 1e+05              0.9     .join(.x)  111.428117
# 17 1e+06              0.9 .distinct(.x)  458.030035
# 18 1e+06              0.9     .join(.x) 1458.408847

Upvotes: 3

Teun
Teun

Reputation: 579

You could use the following. It just row binds the data.frames and in case of duplicates (based on X1) the row of df1 will be removed.

library(dplyr)
df1 <- data.frame(X1 = c("01.01.2000", "01.01.2001", "01.01.2002"),
                  X2 = c(4, 5, 6), stringsAsFactors = F)
df2 <- data.frame(X1 = c("01.01.2002", "01.01.2003", "01.01.2004"),
                  X2 = c(8, 9, 10), stringsAsFactors = F)

dfMerged <- bind_rows(df2, df1) %>% 
  distinct(X1, .keep_all = TRUE) %>% 
  arrange(X1, X2)

Upvotes: 4

Related Questions