ben_p_4370
ben_p_4370

Reputation: 53

How do I conditionally combine some rows (but not others) in R data frames?

I'm trying to figure out how to merge/join two data frames in such a way that if a certain condition is met, R combines two rows from the two data frames into one row, but if the condition is not met, R adds a new row with NAs in the columns that are not present in the original data frame. I'm unclear if this is a simpler join than I think it is, but I've been unable to figure out how to do this, even after reading through some stack overflow results (e.g. or e.g.).

Below are two example data frames:

df1 <- data.frame(Name_df1 = c("Alan", "Steve", "Melanie", "Steve", "Melanie"),
                  Date_df1 = c("02/18/2008", "02/18/2008", "03/14/2009", "04/19/2009", "03/16/2009"),
                  Job_df1 = c("Cook", "Security", "Greeter", "Security", "Greeter"),
                  Hours_df1 = c(8.5, 7.0, 6.0, 7.0, 6.0))

df2 <- data.frame(Name_df2 = c("Steve", "Alan", "Melanie", "Melanie", "Steve", "Carter"),
                  Date_df2 = c("02/18/2008", "02/18/2008", "03/14/2009", "03/17/2009","04/25/2009", "08/15/2011"),
                  Job_df2 = c("Police", "Chef", "Greeter", "Greeter", "Security", "Doorman"),
                  Hours_df2 = c(8.5, 7.0, 6.0, 7.0, 7.0, 6.5),
                  Wage_df2 = c(80, 77, 127.5, 90, 145, 100))

df1 and df2 both describe dates on which certain individuals worked, their job titles, names, and hours worked on the specified date, and only df2 also contains wage information. What I want to do is join df1 and df2 into a df3 with the following columns:

Specifically, I want to join df1 and df2 such that for every row in df1 or df2, if there is an exact match for both the value in the "Name_df1"/"Name_df2" column and the value in the "Date_df1"/"Date_df2" column in the corresponding column in the opposite data frame (i.e. in df1, if the value in the "Date_df1" column and "Name_df1" columns each have an exact match in the "Date_df2" column and "Name_df2" column, respectively, of df2):

However, if there is not an exact match for both the value in the "Name_df1"/"Name_df2" column and the value in the "Date_df1"/"Date_df2" column, then df3 should get a row that includes only the information from df1 in "Name_df1", "Date_df1", "Job_df1", and "Hours_df1", and NAs in "Name_df2", "Date_df2", "Job_df2", "Hours_df2", and "Wage_df2". Or, if the original row is from df2, then df3 should get a row that includes only the information from df2 in "Name_df2", "Date_df2", "Job_df2", "Hours_df2", and "Wage_df2", and NAs in "Name_df1", "Date_df1", "Job_df1", and "Hours_df1".

I know that's a mouthful, but based on the df1 and df2 specified above, this is what df3 would look like:

df3 <- data.frame(Name_df1 = c("Alan", "Steve", "Melanie", "Steve", "Melanie", NA, NA, NA),
                  Name_df2 = c("Alan", "Steve", "Melanie", NA, NA, "Melanie", "Steve", "Carter"),
                  Date_df1 = c("02/18/2008", "02/18/2008", "03/14/2009", "04/19/2009", "03/16/2009", NA, NA, NA),
                  Date_df2 = c("02/18/2008", "02/18/2008", "03/14/2009", NA, NA, "03/17/2009", "04/25/2009", "08/15/2011"),
                  Job_df1 = c("Cook", "Security", "Greeter", "Security", "Greeter", NA, NA, NA),
                  Job_df2 = c("Chef", "Police", "Greeter", NA, NA, "Greeter", "Security", "Doorman"),
                  Hours_df1 = c(7.0, 7.0, 6.0, 7.0, 6.0, NA, NA, NA),
                  Hours_df2 = c(7.0, 8.5, 6.0, NA, NA, 7.0, 7.0, 6.5),
                  Wage_df2 = c(77.0, 80, 127.5, NA, NA, 90.0, 145.0, 100))

Could anyone help me with this?

Upvotes: 1

Views: 873

Answers (3)

moodymudskipper
moodymudskipper

Reputation: 47320

I think you want a full join with keep = TRUE :

library(dplyr)
full_join(df1, df2, by = c(Name_df1 = "Name_df2", Date_df1 = "Date_df2"), keep = TRUE)
#>   Name_df1   Date_df1  Job_df1 Hours_df1 Name_df2   Date_df2  Job_df2 Hours_df2
#> 1     Alan 02/18/2008     Cook       8.5     Alan 02/18/2008     Chef       7.0
#> 2    Steve 02/18/2008 Security       7.0    Steve 02/18/2008   Police       8.5
#> 3  Melanie 03/14/2009  Greeter       6.0  Melanie 03/14/2009  Greeter       6.0
#> 4    Steve 04/19/2009 Security       7.0     <NA>       <NA>     <NA>        NA
#> 5  Melanie 03/16/2009  Greeter       6.0     <NA>       <NA>     <NA>        NA
#> 6     <NA>       <NA>     <NA>        NA  Melanie 03/17/2009  Greeter       7.0
#> 7     <NA>       <NA>     <NA>        NA    Steve 04/25/2009 Security       7.0
#> 8     <NA>       <NA>     <NA>        NA   Carter 08/15/2011  Doorman       6.5
#>   Wage_df2
#> 1     77.0
#> 2     80.0
#> 3    127.5
#> 4       NA
#> 5       NA
#> 6     90.0
#> 7    145.0
#> 8    100.0

Created on 2022-01-17 by the reprex package (v2.0.1)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

It seems you are looking for full_join but to get the exact desired output (df3) you need to perform some manipulation.

library(dplyr)

full_join(df1, df2, by = c('Name_df1' = 'Name_df2', 'Date_df1' = 'Date_df2')) %>%
  mutate(Name_df2 = ifelse(is.na(Job_df2), NA, Name_df1),
         Name_df1 = ifelse(is.na(Job_df1), NA, Name_df1), 
         Date_df2 = ifelse(is.na(Job_df2), NA, Date_df1),
         Date_df1 = ifelse(is.na(Job_df1), NA, Date_df1)) %>%
  select(starts_with('Name'), starts_with('Date'), starts_with('Job'), everything())

#  Name_df1 Name_df2   Date_df1   Date_df2  Job_df1  Job_df2 Hours_df1 Hours_df2 Wage_df2
#1     Alan     Alan 02/18/2008 02/18/2008     Cook     Chef       8.5       7.0     77.0
#2    Steve    Steve 02/18/2008 02/18/2008 Security   Police       7.0       8.5     80.0
#3  Melanie  Melanie 03/14/2009 03/14/2009  Greeter  Greeter       6.0       6.0    127.5
#4    Steve     <NA> 04/19/2009       <NA> Security     <NA>       7.0        NA       NA
#5  Melanie     <NA> 03/16/2009       <NA>  Greeter     <NA>       6.0        NA       NA
#6     <NA>  Melanie       <NA> 03/17/2009     <NA>  Greeter        NA       7.0     90.0
#7     <NA>    Steve       <NA> 04/25/2009     <NA> Security        NA       7.0    145.0
#8     <NA>   Carter       <NA> 08/15/2011     <NA>  Doorman        NA       6.5    100.0

Upvotes: 1

Julian
Julian

Reputation: 11

Maybe this could be a way to join those two data frames:

library(dplyr) df3 <- df1 %>% full_join(df2, by="Name", suffix= c(".df1", ".df2"))

Result:

Table

Upvotes: 1

Related Questions