Reputation: 411
I have two dataframe df1 and df2, i want to create a new dataframe by looking at the union of the two dataset. If a particular column has value 1 in both dataset, the new dataset should have value 1 for that particular column.
df1 = data.frame( V1 = letters[1:5], V2 = c("0","1","1","0","1"), V3 = c("0","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("0","0","0","0","1"),V6 =c("1","1","1","0","0"))
df2 = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","0"), V3 = c("1","0","0","0","1"), V4 =c("0","0","1","0","1"), V5 = c("1","0","0","0","1"))
result = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","1"), V3 = c("1","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("1","0","0","0","1"),V6 =c("1","1","1","0","0"))
Upvotes: 1
Views: 61
Reputation: 3200
Here is my first attempt; although I'm sure this can be improved:
library(tidyverse)
set.seed(345)
df1 <- tibble(
V1 = letters[1:5],
V2 = sample(c(0,1), 5, replace = TRUE),
V3 = sample(c(0,1), 5, replace = TRUE)
)
df2 <- tibble(
V1 = letters[1:5],
V2 = sample(c(0,1), 5, replace = TRUE),
V3 = sample(c(0,1), 5, replace = TRUE)
)
df1
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 0 0
3 c 0 1
4 d 1 0
5 e 0 0
df2
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 0
2 b 1 1
3 c 0 0
4 d 1 1
5 e 1 1
The draft solution:
result <- df1 %>%
left_join(df2, by = "V1") %>%
rowwise() %>%
mutate(
V2 = max(V2.x, V2.y),
V3 = max(V3.x, V3.y)
) %>%
select(V1, V2, V3)
result
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
Obviously, if you have a large number of variables, this becomes a less ideal answer.
UPDATE:
Here is how to make the solution even more general for an arbitrary number of columns:
df1 %>%
select(V1) %>%
bind_cols(
map2_df(
.x = df1[-1],
.y = df2[-1],
.f = ~ map2_dbl(.x, .y, max)
)
)
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
This is how it works:
We can supply map2_dbl with one pair of vectors and find the max of the two vectors at each position like so:
map2_dbl(
.x = c(0, 0, 0, 1, 0),
.y = c(0, 1, 0, 1, 1),
.f = max
)
[1] 0 1 0 1 1
That will become the inner-most portion of the solution. Now, we just need to figure out how to pass in all pairs of variables from both data frames iteratively to the map2_dbl above. This silly example shows how it works:
map2(
.x = df1[-1],
.y = df2[-1],
.f = function(x = .x, y = .y) {
cat("x = ", x, "y = ", y, "\n")
}
)
x = 0 0 0 1 0 y = 0 1 0 1 1
x = 1 0 1 0 0 y = 0 1 0 1 1
$V2
NULL
$V3
NULL
Notice that in the first pass x = df1$V2 and y = df2$V2. In the second iteration x = df1$V3 and y = df2$V3. That is exactly what we want.
We could use three steps to get our final solution:
x1 <- df1 %>%
select(V1)
x2 <- map2_df(
.x = df1[-1],
.y = df2[-1],
.f = function(x = .x, y = .y) {
map2_dbl(x, y, max)
}
)
bind_cols(x1, x2)
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
OR, we can combine these steps into one pipeline:
df1 %>%
select(V1) %>%
bind_cols(
map2_df(
.x = df1[-1],
.y = df2[-1],
.f = ~ map2_dbl(.x, .y, max)
)
)
# A tibble: 5 x 3
V1 V2 V3
<chr> <dbl> <dbl>
1 a 0 1
2 b 1 1
3 c 0 1
4 d 1 1
5 e 1 1
Upvotes: 3