Brendan Collins
Brendan Collins

Reputation: 11

Compare two data frames with same columns in R

I'm looking to compare two data frames in R and basically see where differences exist. Most of the answers I've found don't offer the solution in a format I'm looking for (e.g., found many answers comparing 1 column only, comparing only numeric value, showing only the total number of changes).

I've tried waldo::compare(tbl_1,tbl_2), and that got me close, but the output isn't formatted in a manner to share with non-R users. The output format also isn't ideal for large data frames.

I've also tried using anti_join from dplyr, which got me the closest to the ideal output; however, the anti_join solution doesn't show which variables changed, it only lets me know if there was some sort of the change in any given row. My tables look like the below, but with significantly more columns and rows. Each table has the same columns and same number of rows. I'm ideally trying to get an output that matches the last table below (tbl_diff), but this is proving to be less straightforward than expected.

join1 <- anti_join(tbl_1, tbl_2)
join2 <- anti_join(tbl_2, tbl_1)
tbl_diff <- left_join(tbl_1, tbl_2)

tbl_1:

id state widget count color
_bxhd7 IL widget_1 5 green
_ex9un MA widget_1 2 brown
_aolm0 CA widget_3 3 yellow
_m017e FL widget_7 8 orange

tbl_2:

id state widget count color
_bxhd7 TX widget_1 5 green
_ex9un MA widget_1 20 brown
_aolm0 CA widget_3 3 yellow
_m017e FL widget_2 8 blue

tbl_diff (ideal output):

id state widget count color
_bxhd7 FALSE TRUE TRUE TRUE
_ex9un TRUE TRUE FALSE TRUE
_aolm0 TRUE TRUE TRUE TRUE
_m017e TRUE FALSE TRUE FALSE

Upvotes: 1

Views: 2080

Answers (1)

Wencheng Lau-Medrano
Wencheng Lau-Medrano

Reputation: 753

You only have to change the way you are focusing the solution. I will share an example using some random datasets (if you edit your question sharing your tbl_1 and tbl_2, it would be better).

set.seed(666)

df1 <- mtcars[1:10, 1:4]
df1$name <- rownames(df1)
rownames(df1) <- NULL

df2 <- df1
df2$name <- sample(x = df2$name)
df2$mpg <- sample(x = df2$mpg)

require(dplyr)

bind_rows(df1, df2) %>% 
  
  group_by(name) %>% 
  
  summarise_all(function(x) length(unique(x)) == 1)

Upvotes: 1

Related Questions