Reputation: 4201
Typically I use dplyr::distinct()
to remove duplicated rows from the data. This function selects one copy of the duplicated rows and keeps it.
However, sometimes I wish to remove all copies if suspect the row is not valid.
Let's say that I survey people and ask them about height, weight, and country they're from.
library(dplyr)
library(tibble)
set.seed(2021)
df_1 <- data.frame(id = 1:10,
height = sample(c(150:210), size = 10),
weight = sample(c(80: 200), size = 10))
df_2 <- df_1
df_final <- rbind(df_1, df_2)
df_final <- dplyr::arrange(df_final, id)
df_final <-
df_final %>%
add_column("country" = c("uk", "uk",
"france", "usa",
"germany", "germany",
"denmark", "norway",
"india", "india",
"chine", "china",
"mozambique", "argentina",
"morroco", "morroco",
"sweden", "japan",
"italy", "italy"))
df_final
#> id height weight country
#> 1 1 156 189 uk
#> 2 1 156 189 uk
#> 3 2 187 148 france
#> 4 2 187 148 usa
#> 5 3 195 190 germany
#> 6 3 195 190 germany
#> 7 4 207 182 denmark
#> 8 4 207 182 norway
#> 9 5 188 184 india
#> 10 5 188 184 india
#> 11 6 161 102 chine
#> 12 6 161 102 china
#> 13 7 201 155 mozambique
#> 14 7 201 155 argentina
#> 15 8 155 130 morroco
#> 16 8 155 130 morroco
#> 17 9 209 139 sweden
#> 18 9 209 139 japan
#> 19 10 202 97 italy
#> 20 10 202 97 italy
Created on 2021-07-19 by the reprex package (v2.0.0)
In df_final
, each id
means one person. In this example data we have duplicates for all 10 people. Everyone took the survey twice. However, if we look closely we see that some people reported they're from a different country. For example, id == 2
reported both usa
in one case and france
in another. In my data cleaning I wish to remove those people.
My primary goal is to remove duplicates. My secondary goal is to filter out those people who answered a different country
.
If I simply go with dplyr::distinct()
, I remain with all 10 id
s.
df_final %>%
distinct(id, .keep_all = TRUE)
#> id height weight country
#> 1 1 156 189 uk
#> 2 2 187 148 france
#> 3 3 195 190 germany
#> 4 4 207 182 denmark
#> 5 5 188 184 india
#> 6 6 161 102 chine
#> 7 7 201 155 mozambique
#> 8 8 155 130 morroco
#> 9 9 209 139 sweden
#> 10 10 202 97 italy
What should I do in order to run distinct()
but only on those who have the same value for country
in all duplicated copies (per id
)?
Thanks
Upvotes: 0
Views: 57
Reputation: 887531
We may also do
library(dplyr)
df_final %>%
distinct(id, country, .keep_all = TRUE) %>%
filter(id %in% names(which(table(id) == 1)))
Upvotes: 1
Reputation: 18425
Here is one option...
df_final %>%
group_by(id) %>%
filter(length(unique(country)) == 1) %>%
distinct()
# A tibble: 5 x 4
# Groups: id [5]
id height weight country
<int> <int> <int> <chr>
1 1 177 83 uk
2 3 191 151 germany
3 5 186 175 india
4 8 164 178 morroco
5 10 201 141 italy
Upvotes: 1