Reputation: 15
I have a dataset with 3 columns: ID, value a, and value b. I want to group the dataset based on the values in the ID column and then identify duplicates that have identical data in the value a and b columns between the different groupings.
I know that I can use the dplyr package and data %>% group_by (ID) to group my dataset based on the ID column. I also know that I can use data[duplicated(data[,2:3]),] to return all rows with duplicate data in rows 2 (value a) and 3 (value b).
However, I would like a function that can only finds duplicates between different ID groups instead of just duplicates within the whole dataset. I've tried combining group_by and duplicated, but it doesn't return the correct results. Which function would do this?
Upvotes: 0
Views: 1297
Reputation: 8861
It was a little unclear if you wanted to return:
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1
Upvotes: 1