R: remove rows that are duplicate in two columns and different in a third

Question

I would like to find rows that match in two columns but differ in a third and retain only one of these lines. So for example:

animal_couples <- data.frame(ID=c(1,2,3,4,5,6,7,8,9,10,11,12),species=c("Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger","Badger"),location=c("Germany","Germany","Iceland","France","France","Iceland","Greece","Greece","Germany","Germany","France","Spain"))

A row can match in 'species' and 'partner' so long as it also matches in 'location'. So the first two rows in this df are fine as Germany and Germany are the same. The next three rows are then removed. So the final df should be:

animal_couples_after <- data.frame(ID=c(1,2,6,7,8,9,10,11),species=c("Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger"),location=c("Germany","Germany","Iceland","Greece","Greece","Germany","Germany","France"))

The real dataset is quite large so I don't think looping through each row would be an option.

Thanks a lot for your help.

arg0naut91 · Accepted Answer

Could try:

library(data.table)

setDT(animal_couples)[, idx := rleid(location), by = .(species, partner)][idx == 1, ][, idx := NULL]

Output:

   ID species partner location
1:  1     Cat     Cat  Germany
2:  2     Cat     Cat  Germany
3:  6     Dog     Cat  Iceland
4:  7     Dog     Dog   Greece
5:  8     Dog     Dog   Greece
6:  9    Fish     Dog  Germany
7: 10    Fish     Dog  Germany
8: 11    Fish  Badger   France

Or also shortened:

setDT(animal_couples)[, .SD[rleid(location) == 1], by = .(species, partner)]

R: remove rows that are duplicate in two columns and different in a third

Answers (1)

Related Questions