Reputation: 23
I have a question regarding a specific type of merge using data.frame in R (found a lot of similar problems, but couldn't get it to work for my specific problem)
Suppose I have two dataframes with tow columns X1,X2 each:
df1 =
X1 X2
1 '01.01.2000' 4
2 '01.01.2001' 5
3 '01.01.2002' 6
df2 =
X1 X2
1 '01.01.2002' 8
2 '01.01.2003' 9
3 '01.01.2004' 10
What I want is a merged dataframe according to the following rules:
X1
is only in df1
, use the value of X2
in df1
X1
is in both df1
and df2
use the value of X2
from df2
X1
is only in df2
, use the value of X2
in df2
For df1
and df2
above, this would mean:
dfMerged =
X1 X2
1 '01.01.2000' 4
2 '01.01.2001' 5
3 '01.01.2002' 8
4 '01.01.2003' 9
5 '01.01.2004' 10
Currently, I'm using a very slow solution by merging first and then iterating over all rows. Also tried various approaches using dplyr::Union etc, but couldn't find a proper solution. Any help is greatly appreciated!
Upvotes: 2
Views: 367
Reputation: 20329
Data
df1 <- structure(list(X1 = c("01.01.2000", "01.01.2001", "01.01.2002"),
X2 = 4:6),
class = "data.frame",
row.names = c(NA, -3L))
df2 <- structure(list(X1 = c("01.01.2002", "01.01.2003", "01.01.2004"),
X2 = 8:10),
class = "data.frame",
row.names = c(NA, -3L))
Code
library(dplyr)
full_join(df1, df2, by = "X1") %>%
mutate(X2 = case_when(!is.na(X2.x) & !is.na(X2.y) ~ X2.y,
is.na(X2.y) ~ X2.x,
is.na(X2.x) ~ X2.y)) %>%
select(X1, X2)
Explanation
full_join
on both data sets with X1
as joining column. This creates column X2.x
and X2.y
which will bear the X2 values of the respective data sets.mutate
to select the right column depending on the rules you gave.Benchmark
The distinct
solution is by any means faster by a factor of 3
, as the following benchmark shows:
library(tidyverse)
library(microbenchmark)
make_data_frame <- function(n, percent_matching = .1) {
ids_a <- ids_b <- paste0("ID_", seq.int(n))
non_matching_ids <- sample(n, round(n * (1 - percent_matching), 0))
ids_b[non_matching_ids] <- paste(ids_b[non_matching_ids], "b", sep = "_")
list(A = data.frame(X1 = ids_a, X2 = "a", stringsAsFactors = FALSE),
B = data.frame(X1 = ids_b, X2 = "b", stringsAsFactors = FALSE))
}
.distinct <- function(dfs) {
bind_rows(dfs$B, dfs$A) %>%
distinct(X1, .keep_all = TRUE)
}
.join <- function(dfs) {
full_join(dfs$A, dfs$B, by = "X1") %>%
mutate(X2 = case_when(!is.na(X2.x) & !is.na(X2.y) ~ X2.y,
is.na(X2.y) ~ X2.x,
is.na(X2.x) ~ X2.y))
}
scenarios <- expand.grid(n = c(1e4, 1e5, 1e6),
percent_matching = c(.1, .5, .9))
all_data <- pmap(scenarios, make_data_frame)
all_mb <- map(all_data, ~ microbenchmark(.distinct(.x), .join(.x)))
map_dfr(seq.int(NROW(scenarios)), function(i) {
mdat <- scenarios[i, ]
my_summary <- summary(all_mb[[i]])
rownames(mdat) <- NULL
rownames(my_summary) <- NULL
cbind(mdat, my_summary)
}) %>%
select(n, percent_matching, expr, mean)
# n percent_matching expr mean
# 1 1e+04 0.1 .distinct(.x) 4.975013
# 2 1e+04 0.1 .join(.x) 12.587072
# 3 1e+05 0.1 .distinct(.x) 59.577142
# 4 1e+05 0.1 .join(.x) 149.987451
# 5 1e+06 0.1 .distinct(.x) 1.158597
# 6 1e+06 0.1 .join(.x) 2.699003
# 7 1e+04 0.5 .distinct(.x) 4.485196
# 8 1e+04 0.5 .join(.x) 11.902656
# 9 1e+05 0.5 .distinct(.x) 46.660016
# 10 1e+05 0.5 .join(.x) 132.180758
# 11 1e+06 0.5 .distinct(.x) 913.503111
# 12 1e+06 0.5 .join(.x) 2148.531600
# 13 1e+04 0.9 .distinct(.x) 4.299905
# 14 1e+04 0.9 .join(.x) 12.731292
# 15 1e+05 0.9 .distinct(.x) 37.558069
# 16 1e+05 0.9 .join(.x) 111.428117
# 17 1e+06 0.9 .distinct(.x) 458.030035
# 18 1e+06 0.9 .join(.x) 1458.408847
Upvotes: 3
Reputation: 579
You could use the following. It just row binds the data.frames and in case of duplicates (based on X1) the row of df1 will be removed.
library(dplyr)
df1 <- data.frame(X1 = c("01.01.2000", "01.01.2001", "01.01.2002"),
X2 = c(4, 5, 6), stringsAsFactors = F)
df2 <- data.frame(X1 = c("01.01.2002", "01.01.2003", "01.01.2004"),
X2 = c(8, 9, 10), stringsAsFactors = F)
dfMerged <- bind_rows(df2, df1) %>%
distinct(X1, .keep_all = TRUE) %>%
arrange(X1, X2)
Upvotes: 4