Paul PR
Paul PR

Reputation: 178

Using setdiff on dataframes with list columns

Is there an R function to get the rows that are in one data.frame but not in another, if the data.frames contain list-columns? I know dplyr::setdiff will work on regular data.frames, but if I apply it to a data.frame with a list-column, I get an error.

list_df1 <- data.frame(x = c(1, 2, 2))
list_df1$y <- list(c("A", "B"), c("C"), c("B", "C"))
list_df2 <- data.frame(x = c(2, 3))
list_df2$y <- list(c("C"), c("D", "E"))
dplyr::setdiff(list_df1, list_df2)
#> Error: Can't join on 'y' x 'y' because of incompatible types (list / list)

Currently I've been using a loop over the rows in both data.frames and directly comparing if the rows are equal:

in_df2 <- rep(FALSE, nrow(list_df1))
for (row_ind1 in seq_len(nrow(list_df1))) {
  for (row_ind2 in seq_len(nrow(list_df2))) {
    rows_equal <- all.equal(list_df1[row_ind1, ], 
                            list_df2[row_ind2, ], 
                            check.attributes = FALSE)
    if (isTRUE(rows_equal)) {
      in_df2[row_ind1] <- TRUE
      break
    }
  }
}
list_df1[!in_df2, ]
#>   x    y
#> 1 1 A, B
#> 3 2 B, C

And while gives the result I'm looking for, I'm sure there must be a better or more efficient solution.

dplyr::anti_join is also a possible solution, if the non-list columns uniquely identify the results. But in this case, I want to remove rows only if all entries are identical between the two data.frames. If we apply anti_join on just column x we don't get the results I'm looking for:

dplyr::anti_join(list_df1, list_df2, by = "x")
#>   x    y
#> 1 1 A, B

And applying it to all columns gives an error, just list set_diff

dplr::anti_join(list_df1, list_df2)
#> Error: Can't join on 'y' x 'y' because of incompatible types (list / list)

Upvotes: 3

Views: 1543

Answers (1)

Paul PR
Paul PR

Reputation: 178

The short answer is that the ability to apply setdiff to data.frames with list-columns is not built into R or dplyr at the moment. However, it is possible to implement a solution that's significantly faster than the naive approach of looping over both data.frames.

Option 1: If the list-columns aren't needed to identify the different rows, then dplyr::anti_join() is a great solution. To join by all non-list columns you could use (assuming list_df1 and list_df2 have the same column names):

nonlist_setdiff <- function(list_df1, list_df2) {
  nonlist_vars <- list_df1 %>%
    select_if(function(x) !is.list(x)) %>%
    colnames(.)
  dplyr::anti_join(list_df1, list_df2, by = nonlist_vars)
}

Option 2: If the list-columns are need to identify the different rows, then anti_join() is no longer an option. Instead, you can create a hash of each row of both data.frames, turning them into character vectors. Then you can apply setdiff to the character vectors. This approach uses the digest, dplyr, and purrr packages:

digest_setdiff <- function(list_df1, list_df2) {
  hash1 <- list_df1 %>%
    purrr::transpose() %>% # Transposing helps speed up the calculations
    purrr::map_chr(digest::digest, algo = "xxhash64") #use xxhash64 for speed
  hash2 <- list_df2 %>%
    purrr::transpose() %>%
    purrr::map_chr(digest::digest, algo = "xxhash64")
  only_in_df1 <- setdiff(hash1, hash2)
  list_df1[hash1 %in% only_in_df1, ]
}

This approach should work for all types of data.frames, no matter how complicated, and it's significantly faster than the naive approach. However, it is still way slower than the dplyr::setdiff and dplyr::anti_join commands, so don't use it if you don't have to.

Running digest_setdiff on your test input data gives the desired result:

digest_setdiff(list_df1, list_df2)
#>   x    y
#> 1 1 A, B
#> 3 2 B, C

Upvotes: 3

Related Questions