Removing duplicate records in a dataframe based on the values of a list column

Question

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.

Let's say I have the following tibble:

df <- tribble(
  ~x, ~y,
  1,  tibble(a = 1:2, b = 2:3),
  2,  tibble(a = 1:2, b = 2:3),
  3,  tibble(a = 0:1, b = 0:1)
)

df
#> # A tibble: 3 x 2
#>       x y               
#>              
#> 1     1 
#> 2     2 
#> 3     3

The desired outcome is:

desired_df
#> # A tibble: 2 x 2
#>       x y               
#>              
#> 1     1 
#> 2     3

Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:

distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#>       x y               
#>              
#> 1     1 
#> 2     2 
#> 3     3

Is there any "clean" way to achieve what I want?

akrun · Accepted Answer

One option is to use filter with duplicated

library(dplyr)    
df %>%
    filter(!duplicated(y))

Removing duplicate records in a dataframe based on the values of a list column

Answers (2)

Related Questions