Reputation: 178
Is there an R function to get the rows that are in one data.frame but not in another, if the data.frames contain list-columns? I know dplyr::setdiff will work on regular data.frames, but if I apply it to a data.frame with a list-column, I get an error.
list_df1 <- data.frame(x = c(1, 2, 2))
list_df1$y <- list(c("A", "B"), c("C"), c("B", "C"))
list_df2 <- data.frame(x = c(2, 3))
list_df2$y <- list(c("C"), c("D", "E"))
dplyr::setdiff(list_df1, list_df2)
#> Error: Can't join on 'y' x 'y' because of incompatible types (list / list)
Currently I've been using a loop over the rows in both data.frames and directly comparing if the rows are equal:
in_df2 <- rep(FALSE, nrow(list_df1))
for (row_ind1 in seq_len(nrow(list_df1))) {
for (row_ind2 in seq_len(nrow(list_df2))) {
rows_equal <- all.equal(list_df1[row_ind1, ],
list_df2[row_ind2, ],
check.attributes = FALSE)
if (isTRUE(rows_equal)) {
in_df2[row_ind1] <- TRUE
break
}
}
}
list_df1[!in_df2, ]
#> x y
#> 1 1 A, B
#> 3 2 B, C
And while gives the result I'm looking for, I'm sure there must be a better or more efficient solution.
dplyr::anti_join is also a possible solution, if the non-list columns uniquely identify the results. But in this case, I want to remove rows only if all entries are identical between the two data.frames. If we apply anti_join on just column x we don't get the results I'm looking for:
dplyr::anti_join(list_df1, list_df2, by = "x")
#> x y
#> 1 1 A, B
And applying it to all columns gives an error, just list set_diff
dplr::anti_join(list_df1, list_df2)
#> Error: Can't join on 'y' x 'y' because of incompatible types (list / list)
Upvotes: 3
Views: 1543
Reputation: 178
The short answer is that the ability to apply setdiff to data.frames with list-columns is not built into R or dplyr at the moment. However, it is possible to implement a solution that's significantly faster than the naive approach of looping over both data.frames.
Option 1: If the list-columns aren't needed to identify the different rows, then dplyr::anti_join() is a great solution. To join by all non-list columns you could use (assuming list_df1 and list_df2 have the same column names):
nonlist_setdiff <- function(list_df1, list_df2) {
nonlist_vars <- list_df1 %>%
select_if(function(x) !is.list(x)) %>%
colnames(.)
dplyr::anti_join(list_df1, list_df2, by = nonlist_vars)
}
Option 2: If the list-columns are need to identify the different rows, then anti_join() is no longer an option. Instead, you can create a hash of each row of both data.frames, turning them into character vectors. Then you can apply setdiff to the character vectors. This approach uses the digest, dplyr, and purrr packages:
digest_setdiff <- function(list_df1, list_df2) {
hash1 <- list_df1 %>%
purrr::transpose() %>% # Transposing helps speed up the calculations
purrr::map_chr(digest::digest, algo = "xxhash64") #use xxhash64 for speed
hash2 <- list_df2 %>%
purrr::transpose() %>%
purrr::map_chr(digest::digest, algo = "xxhash64")
only_in_df1 <- setdiff(hash1, hash2)
list_df1[hash1 %in% only_in_df1, ]
}
This approach should work for all types of data.frames, no matter how complicated, and it's significantly faster than the naive approach. However, it is still way slower than the dplyr::setdiff and dplyr::anti_join commands, so don't use it if you don't have to.
Running digest_setdiff on your test input data gives the desired result:
digest_setdiff(list_df1, list_df2)
#> x y
#> 1 1 A, B
#> 3 2 B, C
Upvotes: 3