dhersz
dhersz

Reputation: 593

Removing duplicate records in a dataframe based on the values of a list column

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.

Let's say I have the following tibble:

df <- tribble(
  ~x, ~y,
  1,  tibble(a = 1:2, b = 2:3),
  2,  tibble(a = 1:2, b = 2:3),
  3,  tibble(a = 0:1, b = 0:1)
)

df
#> # A tibble: 3 x 2
#>       x y               
#>   <dbl> <list>          
#> 1     1 <tibble [2 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [2 x 2]>

The desired outcome is:

desired_df
#> # A tibble: 2 x 2
#>       x y               
#>   <dbl> <list>          
#> 1     1 <tibble [2 x 2]>
#> 2     3 <tibble [2 x 2]>

Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:

distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#>       x y               
#>   <dbl> <list>          
#> 1     1 <tibble [2 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [2 x 2]>

Is there any "clean" way to achieve what I want?

Upvotes: 1

Views: 167

Answers (2)

akrun
akrun

Reputation: 887028

One option is to use filter with duplicated

library(dplyr)    
df %>%
    filter(!duplicated(y))

Upvotes: 1

dhersz
dhersz

Reputation: 593

I have come to an answer, but I think it's quite "wordy" (and I suspect it might be slow as well):

df <- df %>% 
  mutate(unique_list_id = match(y, unique(y))) %>% 
  group_by(unique_list_id) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-unique_list_id)

df
#> # A tibble: 2 x 2
#>       x y               
#>   <dbl> <list>          
#> 1     1 <tibble [2 x 2]>
#> 2     3 <tibble [2 x 2]>

Upvotes: 1

Related Questions