Phil
Phil

Reputation: 8107

How to merge two data sets with shared cases and variables without duplicating variables?

Suppose I have the two following data sets:

library(tidyverse)

iris1 <- iris %>% 
  as_tibble() %>% 
  mutate(id = row_number(),
         Petal.Length = ifelse(id > 75, NA, Petal.Length),
         Petal.Width = ifelse(id > 75, NA, Petal.Width))

iris2 <- iris %>% 
  as_tibble() %>% 
  mutate(id = row_number()) %>% 
  filter(id > 75) %>% 
  select(id, Petal.Length, Petal.Width)

Here, I've set iris2 to be data that is missing from iris1. I would like to join the data in iris2 back into iris1, using id as the key. I could use dplyr::full_join and join by id, but I would have duplicated the variables Petal.Length and Petal.Width when in fact I want the values in iris2 integrated into their respective variables in iris1.

Is there a more elegant solution than doing the below? The example here is with 2 variables (Petal.Length and Petal.Width), but in my actual dataset this impacts over a hundred variables, and I'd have to think there's a better way than by manual labour.

left_join(iris1, iris2, by = "id") %>% 
  mutate(Petal.Length = ifelse(is.na(Petal.Length.x), Petal.Length.y, Petal.Length.x),
         Petal.Width = ifelse(is.na(Petal.Width.x), Petal.Width.y, Petal.Width.x)) %>% 
  select(-Petal.Length.x, -Petal.Length.y, -Petal.Width.x, -Petal.Width.y)

Upvotes: 0

Views: 55

Answers (2)

Phil
Phil

Reputation: 8107

Another potential solution, but this assumes that there are no NA cases prior to the manipulation. Basically, join the two data files, turn them to long, remove the ".x" and ".y" from the variable names, and remove NA rows, and then spread the data back to its original shape:

iris1 %>% 
  left_join(iris2, by = "id") %>% 
  gather(key = key, value = value, -id) %>%  
  mutate(key = str_remove(key, "\\.[xy]$")) %>% 
  filter(!is.na(value)) %>% 
  spread(key = key, value = value)

Upvotes: 0

Shree
Shree

Reputation: 11150

I reduced your data set to 10 rows. I haven't thought this through but this could work.

Here are the steps - 1) added a dataframe identifier field tablename 2) did a union using bind_rows 3) grouped by id and got the first non-NA value for each column using summarize_all(). This means that you need to bind tables in order of preference.

iris1 <- head(iris, 10) %>% 
  as_tibble() %>% 
  mutate(id = row_number(),
         Petal.Length = ifelse(id > 7, NA, Petal.Length),
         Petal.Width = ifelse(id > 7, NA, Petal.Width),
         tablename = "table1"
         )

# A tibble: 10 x 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    id tablename
          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int> <chr>    
 1         5.10        3.50         1.40       0.200 setosa      1 table1   
 2         4.90        3.00         1.40       0.200 setosa      2 table1   
 3         4.70        3.20         1.30       0.200 setosa      3 table1   
 4         4.60        3.10         1.50       0.200 setosa      4 table1   
 5         5.00        3.60         1.40       0.200 setosa      5 table1   
 6         5.40        3.90         1.70       0.400 setosa      6 table1   
 7         4.60        3.40         1.40       0.300 setosa      7 table1   
 8         5.00        3.40        NA         NA     setosa      8 table1   
 9         4.40        2.90        NA         NA     setosa      9 table1   
10         4.90        3.10        NA         NA     setosa     10 table1  

iris2 <- head(iris, 10) %>% 
  as_tibble() %>% 
  mutate(id = row_number(), tablename = "table2") %>% 
  filter(id > 7) %>% 
  select(id, Petal.Length, Petal.Width, tablename)

# A tibble: 3 x 4
     id Petal.Length Petal.Width tablename
  <int>        <dbl>       <dbl> <chr>    
1     8         1.50       0.200 table2   
2     9         1.40       0.200 table2   
3    10         1.50       0.100 table2

combined <- bind_rows(iris1, iris2) %>%
  group_by(id) %>%
  summarize_all(function(x) x[!is.na(x)][1])

# A tibble: 10 x 7
# Groups:   id [10]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    id tablename
          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int> <chr>    
 1         5.10        3.50         1.40       0.200 setosa      1 table1   
 2         4.90        3.00         1.40       0.200 setosa      2 table1   
 3         4.70        3.20         1.30       0.200 setosa      3 table1   
 4         4.60        3.10         1.50       0.200 setosa      4 table1   
 5         5.00        3.60         1.40       0.200 setosa      5 table1   
 6         5.40        3.90         1.70       0.400 setosa      6 table1   
 7         4.60        3.40         1.40       0.300 setosa      7 table1   
 8         5.00        3.40         1.50       0.200 setosa      8 table1   
 9         4.40        2.90         1.40       0.200 setosa      9 table1   
10         4.90        3.10         1.50       0.100 setosa     10 table1

Upvotes: 1

Related Questions