tamcle
tamcle

Reputation: 15

How to find duplicates based on values in 2 columns but also the groupings by another column in R?

I have a dataset with 3 columns: ID, value a, and value b. I want to group the dataset based on the values in the ID column and then identify duplicates that have identical data in the value a and b columns between the different groupings.

I know that I can use the dplyr package and data %>% group_by (ID) to group my dataset based on the ID column. I also know that I can use data[duplicated(data[,2:3]),] to return all rows with duplicate data in rows 2 (value a) and 3 (value b).

However, I would like a function that can only finds duplicates between different ID groups instead of just duplicates within the whole dataset. I've tried combining group_by and duplicated, but it doesn't return the correct results. Which function would do this?

Upvotes: 0

Views: 1297

Answers (1)

the-mad-statter
the-mad-statter

Reputation: 8861

It was a little unclear if you wanted to return:

  1. only the distinct rows
  2. single examples of duplicated rows
  3. all duplicated rows

So here are some options:

library(dplyr)
library(readr)

"ID,a,b
 1, 1, 1
 1, 1, 1
 1, 1, 2
 2, 1, 1
 2, 1, 2" %>% 
  read_csv() -> exp_dat

# return only distinct rows
exp_dat %>% 
  distinct(ID, a, b)

# # A tibble: 4 x 3
#      ID     a     b
#   <dbl> <dbl> <dbl>
# 1     1     1     1
# 2     1     1     2
# 3     2     1     1
# 4     2     1     2

# return single examples of duplicated rows
exp_dat %>% 
  group_by(ID, a, b) %>% 
  count() %>% 
  filter(n > 1) %>% 
  ungroup() %>% 
  select(-n)

# # A tibble: 1 x 3
#      ID     a     b
#   <dbl> <dbl> <dbl>
# 1     1     1     1

# return all duplicated rows
exp_dat %>% 
  group_by(ID, a, b) %>% 
  add_count() %>% 
  filter(n > 1) %>% 
  ungroup() %>% 
  select(-n)

# # A tibble: 2 x 3
#      ID     a     b
#   <dbl> <dbl> <dbl>
# 1     1     1     1
# 2     1     1     1

Upvotes: 1

Related Questions