Reputation: 1

Finding Differences Between Two Dataframes

I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.

The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.

Here is a reproducible example:

df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
          Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
            stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7',   '3', '5', '10'), special_NO=NA, stringsAsFactors=F)

Ideally I would want an output like this:

df3
Name   Vessel_ID   special_NO   add_remove
Vessel2    2           20          remove
Vessel4    4           40          remove
Vessel6    10          NA          add
x          6           NA          add
y          7           NA          add

Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.

I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.

Upvotes: 0

Answers (2)

birdnerd

Reputation: 1

Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:

df1$old_new <- "old"
df2$old_new <- "new"

#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)

#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"

# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]

Upvotes: 0

Andre Wildberg

Reputation: 19088

An approach using bind_rows

library(dplyr)

bind_rows(df1 %>% mutate(add_remove="remove"), 
          df2 %>% mutate(add_remove="add")) %>% 
  group_by(Vessel_ID) %>% 
  filter(n() == 1) %>% 
  ungroup()
# A tibble: 5 × 4
  Name    Vessel_ID special_NO add_remove
  <chr>   <chr>          <dbl> <chr>     
1 Vessel2 2                 20 remove    
2 Vessel4 4                 40 remove    
3 x       6                 NA add       
4 y       7                 NA add       
5 Vessel6 10                NA add

Upvotes: 0

Finding Differences Between Two Dataframes

Answers (2)

Related Questions