How to merge two data sets with shared cases and variables without duplicating variables?

Question

Suppose I have the two following data sets:

library(tidyverse)

iris1 <- iris %>% 
  as_tibble() %>% 
  mutate(id = row_number(),
         Petal.Length = ifelse(id > 75, NA, Petal.Length),
         Petal.Width = ifelse(id > 75, NA, Petal.Width))

iris2 <- iris %>% 
  as_tibble() %>% 
  mutate(id = row_number()) %>% 
  filter(id > 75) %>% 
  select(id, Petal.Length, Petal.Width)

Here, I've set iris2 to be data that is missing from iris1. I would like to join the data in iris2 back into iris1, using id as the key. I could use dplyr::full_join and join by id, but I would have duplicated the variables Petal.Length and Petal.Width when in fact I want the values in iris2 integrated into their respective variables in iris1.

Is there a more elegant solution than doing the below? The example here is with 2 variables (Petal.Length and Petal.Width), but in my actual dataset this impacts over a hundred variables, and I'd have to think there's a better way than by manual labour.

left_join(iris1, iris2, by = "id") %>% 
  mutate(Petal.Length = ifelse(is.na(Petal.Length.x), Petal.Length.y, Petal.Length.x),
         Petal.Width = ifelse(is.na(Petal.Width.x), Petal.Width.y, Petal.Width.x)) %>% 
  select(-Petal.Length.x, -Petal.Length.y, -Petal.Width.x, -Petal.Width.y)

Shree · Accepted Answer

I reduced your data set to 10 rows. I haven't thought this through but this could work.

Here are the steps - 1) added a dataframe identifier field tablename 2) did a union using bind_rows 3) grouped by id and got the first non-NA value for each column using summarize_all(). This means that you need to bind tables in order of preference.

iris1 <- head(iris, 10) %>% 
  as_tibble() %>% 
  mutate(id = row_number(),
         Petal.Length = ifelse(id > 7, NA, Petal.Length),
         Petal.Width = ifelse(id > 7, NA, Petal.Width),
         tablename = "table1"
         )

# A tibble: 10 x 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    id tablename
                                         
 1         5.10        3.50         1.40       0.200 setosa      1 table1   
 2         4.90        3.00         1.40       0.200 setosa      2 table1   
 3         4.70        3.20         1.30       0.200 setosa      3 table1   
 4         4.60        3.10         1.50       0.200 setosa      4 table1   
 5         5.00        3.60         1.40       0.200 setosa      5 table1   
 6         5.40        3.90         1.70       0.400 setosa      6 table1   
 7         4.60        3.40         1.40       0.300 setosa      7 table1   
 8         5.00        3.40        NA         NA     setosa      8 table1   
 9         4.40        2.90        NA         NA     setosa      9 table1   
10         4.90        3.10        NA         NA     setosa     10 table1  

iris2 <- head(iris, 10) %>% 
  as_tibble() %>% 
  mutate(id = row_number(), tablename = "table2") %>% 
  filter(id > 7) %>% 
  select(id, Petal.Length, Petal.Width, tablename)

# A tibble: 3 x 4
     id Petal.Length Petal.Width tablename
                      
1     8         1.50       0.200 table2   
2     9         1.40       0.200 table2   
3    10         1.50       0.100 table2

combined <- bind_rows(iris1, iris2) %>%
  group_by(id) %>%
  summarize_all(function(x) x[!is.na(x)][1])

# A tibble: 10 x 7
# Groups:   id [10]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    id tablename
                                         
 1         5.10        3.50         1.40       0.200 setosa      1 table1   
 2         4.90        3.00         1.40       0.200 setosa      2 table1   
 3         4.70        3.20         1.30       0.200 setosa      3 table1   
 4         4.60        3.10         1.50       0.200 setosa      4 table1   
 5         5.00        3.60         1.40       0.200 setosa      5 table1   
 6         5.40        3.90         1.70       0.400 setosa      6 table1   
 7         4.60        3.40         1.40       0.300 setosa      7 table1   
 8         5.00        3.40         1.50       0.200 setosa      8 table1   
 9         4.40        2.90         1.40       0.200 setosa      9 table1   
10         4.90        3.10         1.50       0.100 setosa     10 table1

How to merge two data sets with shared cases and variables without duplicating variables?

Answers (2)

Related Questions