Reputation: 11
I'm looking to compare two data frames in R and basically see where differences exist. Most of the answers I've found don't offer the solution in a format I'm looking for (e.g., found many answers comparing 1 column only, comparing only numeric value, showing only the total number of changes).
I've tried waldo::compare(tbl_1,tbl_2)
, and that got me close, but the output isn't formatted in a manner to share with non-R users. The output format also isn't ideal for large data frames.
I've also tried using anti_join from dplyr, which got me the closest to the ideal output; however, the anti_join solution doesn't show which variables changed, it only lets me know if there was some sort of the change in any given row. My tables look like the below, but with significantly more columns and rows. Each table has the same columns and same number of rows. I'm ideally trying to get an output that matches the last table below (tbl_diff), but this is proving to be less straightforward than expected.
join1 <- anti_join(tbl_1, tbl_2)
join2 <- anti_join(tbl_2, tbl_1)
tbl_diff <- left_join(tbl_1, tbl_2)
tbl_1:
id | state | widget | count | color |
---|---|---|---|---|
_bxhd7 | IL | widget_1 | 5 | green |
_ex9un | MA | widget_1 | 2 | brown |
_aolm0 | CA | widget_3 | 3 | yellow |
_m017e | FL | widget_7 | 8 | orange |
tbl_2:
id | state | widget | count | color |
---|---|---|---|---|
_bxhd7 | TX | widget_1 | 5 | green |
_ex9un | MA | widget_1 | 20 | brown |
_aolm0 | CA | widget_3 | 3 | yellow |
_m017e | FL | widget_2 | 8 | blue |
tbl_diff (ideal output):
id | state | widget | count | color |
---|---|---|---|---|
_bxhd7 | FALSE | TRUE | TRUE | TRUE |
_ex9un | TRUE | TRUE | FALSE | TRUE |
_aolm0 | TRUE | TRUE | TRUE | TRUE |
_m017e | TRUE | FALSE | TRUE | FALSE |
Upvotes: 1
Views: 2080
Reputation: 753
You only have to change the way you are focusing the solution. I will share an example using some random datasets (if you edit your question sharing your tbl_1
and tbl_2
, it would be better).
set.seed(666)
df1 <- mtcars[1:10, 1:4]
df1$name <- rownames(df1)
rownames(df1) <- NULL
df2 <- df1
df2$name <- sample(x = df2$name)
df2$mpg <- sample(x = df2$mpg)
require(dplyr)
bind_rows(df1, df2) %>%
group_by(name) %>%
summarise_all(function(x) length(unique(x)) == 1)
Upvotes: 1