Emman
Emman

Reputation: 4201

How to remove duplicated rows (e.g. dplyr::distinct()) while deleting some instances entirely (based on similarity in a specific column)?

Typically I use dplyr::distinct() to remove duplicated rows from the data. This function selects one copy of the duplicated rows and keeps it.
However, sometimes I wish to remove all copies if suspect the row is not valid.

Example

Let's say that I survey people and ask them about height, weight, and country they're from.

library(dplyr)
library(tibble)

set.seed(2021)
df_1 <- data.frame(id = 1:10,
           height = sample(c(150:210), size = 10),
           weight = sample(c(80: 200), size = 10))

df_2 <- df_1

df_final <- rbind(df_1, df_2)
df_final <- dplyr::arrange(df_final, id)

df_final <-
  df_final %>%
  add_column("country" = c("uk", "uk", 
                           "france", "usa", 
                           "germany", "germany",
                           "denmark", "norway",
                           "india", "india",
                           "chine", "china",
                           "mozambique", "argentina",
                           "morroco", "morroco",
                           "sweden", "japan",
                           "italy", "italy"))


df_final
#>    id height weight    country
#> 1   1    156    189         uk
#> 2   1    156    189         uk
#> 3   2    187    148     france
#> 4   2    187    148        usa
#> 5   3    195    190    germany
#> 6   3    195    190    germany
#> 7   4    207    182    denmark
#> 8   4    207    182     norway
#> 9   5    188    184      india
#> 10  5    188    184      india
#> 11  6    161    102      chine
#> 12  6    161    102      china
#> 13  7    201    155 mozambique
#> 14  7    201    155  argentina
#> 15  8    155    130    morroco
#> 16  8    155    130    morroco
#> 17  9    209    139     sweden
#> 18  9    209    139      japan
#> 19 10    202     97      italy
#> 20 10    202     97      italy

Created on 2021-07-19 by the reprex package (v2.0.0)

In df_final, each id means one person. In this example data we have duplicates for all 10 people. Everyone took the survey twice. However, if we look closely we see that some people reported they're from a different country. For example, id == 2 reported both usa in one case and france in another. In my data cleaning I wish to remove those people.

My primary goal is to remove duplicates. My secondary goal is to filter out those people who answered a different country.

If I simply go with dplyr::distinct(), I remain with all 10 ids.

df_final %>%
  distinct(id, .keep_all = TRUE)
#>    id height weight    country
#> 1   1    156    189         uk
#> 2   2    187    148     france
#> 3   3    195    190    germany
#> 4   4    207    182    denmark
#> 5   5    188    184      india
#> 6   6    161    102      chine
#> 7   7    201    155 mozambique
#> 8   8    155    130    morroco
#> 9   9    209    139     sweden
#> 10 10    202     97      italy

What should I do in order to run distinct() but only on those who have the same value for country in all duplicated copies (per id)?

Thanks

Upvotes: 0

Views: 57

Answers (2)

akrun
akrun

Reputation: 887531

We may also do

library(dplyr)
df_final %>%
    distinct(id, country, .keep_all = TRUE) %>%
    filter(id %in% names(which(table(id) == 1)))

Upvotes: 1

Andrew Gustar
Andrew Gustar

Reputation: 18425

Here is one option...

df_final %>% 
   group_by(id) %>% 
   filter(length(unique(country)) == 1) %>% 
   distinct()

# A tibble: 5 x 4
# Groups:   id [5]
     id height weight country
  <int>  <int>  <int> <chr>  
1     1    177     83 uk     
2     3    191    151 germany
3     5    186    175 india  
4     8    164    178 morroco
5    10    201    141 italy  

Upvotes: 1

Related Questions