beginner
beginner

Reputation: 411

compare two dataset and create a new dataframe by looking at the union of both dataset

I have two dataframe df1 and df2, i want to create a new dataframe by looking at the union of the two dataset. If a particular column has value 1 in both dataset, the new dataset should have value 1 for that particular column.

    df1 = data.frame( V1 = letters[1:5], V2 = c("0","1","1","0","1"), V3 = c("0","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("0","0","0","0","1"),V6 =c("1","1","1","0","0"))

    df2 = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","0"), V3 = c("1","0","0","0","1"), V4 =c("0","0","1","0","1"), V5 = c("1","0","0","0","1"))

   result = data.frame( V1 = letters[1:5], V2 = c("1","1","1","0","1"), V3 = c("1","0","0","0","1"), V4 =c("1","1","1","1","1"), V5 = c("1","0","0","0","1"),V6 =c("1","1","1","0","0"))

Upvotes: 1

Views: 61

Answers (1)

Brad Cannell
Brad Cannell

Reputation: 3200

Here is my first attempt; although I'm sure this can be improved:

library(tidyverse)

set.seed(345)

df1 <- tibble(
  V1 = letters[1:5],
  V2 = sample(c(0,1), 5, replace = TRUE),
  V3 = sample(c(0,1), 5, replace = TRUE)
)

df2 <- tibble(
  V1 = letters[1:5],
  V2 = sample(c(0,1), 5, replace = TRUE),
  V3 = sample(c(0,1), 5, replace = TRUE)
)

df1

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     0     0
3     c     0     1
4     d     1     0
5     e     0     0

df2

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     0
2     b     1     1
3     c     0     0
4     d     1     1
5     e     1     1

The draft solution:

result <- df1 %>% 
  left_join(df2, by = "V1") %>% 
  rowwise() %>% 
  mutate(
    V2 = max(V2.x, V2.y),
    V3 = max(V3.x, V3.y)
  ) %>% 
  select(V1, V2, V3)

result

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

Obviously, if you have a large number of variables, this becomes a less ideal answer.

UPDATE:

Here is how to make the solution even more general for an arbitrary number of columns:

df1 %>% 
  select(V1) %>% 
  bind_cols( 
    map2_df(
      .x = df1[-1],
      .y = df2[-1], 
      .f = ~ map2_dbl(.x, .y, max)
    )
  )
# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

This is how it works:

We can supply map2_dbl with one pair of vectors and find the max of the two vectors at each position like so:

map2_dbl(
  .x = c(0, 0, 0, 1, 0), 
  .y = c(0, 1, 0, 1, 1), 
  .f = max
)

[1] 0 1 0 1 1

That will become the inner-most portion of the solution. Now, we just need to figure out how to pass in all pairs of variables from both data frames iteratively to the map2_dbl above. This silly example shows how it works:

map2(
  .x = df1[-1], 
  .y = df2[-1], 
  .f = function(x = .x, y = .y) {
    cat("x = ", x, "y = ", y, "\n")
  }
)

x =  0 0 0 1 0 y =  0 1 0 1 1 
x =  1 0 1 0 0 y =  0 1 0 1 1 
$V2
NULL

$V3
NULL

Notice that in the first pass x = df1$V2 and y = df2$V2. In the second iteration x = df1$V3 and y = df2$V3. That is exactly what we want.

We could use three steps to get our final solution:

x1 <- df1 %>% 
  select(V1)

x2 <- map2_df(
  .x = df1[-1], 
  .y = df2[-1], 
  .f = function(x = .x, y = .y) {
    map2_dbl(x, y, max)
  }
)

bind_cols(x1, x2)

# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

OR, we can combine these steps into one pipeline:

df1 %>% 
  select(V1) %>% 
  bind_cols( 
    map2_df(
      .x = df1[-1],
      .y = df2[-1], 
      .f = ~ map2_dbl(.x, .y, max)
    )
  )
# A tibble: 5 x 3
     V1    V2    V3
  <chr> <dbl> <dbl>
1     a     0     1
2     b     1     1
3     c     0     1
4     d     1     1
5     e     1     1

Upvotes: 3

Related Questions