person
person

Reputation: 96

R: remove rows that are duplicate in two columns and different in a third

I would like to find rows that match in two columns but differ in a third and retain only one of these lines. So for example:

animal_couples <- data.frame(ID=c(1,2,3,4,5,6,7,8,9,10,11,12),species=c("Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger","Badger"),location=c("Germany","Germany","Iceland","France","France","Iceland","Greece","Greece","Germany","Germany","France","Spain"))

A row can match in 'species' and 'partner' so long as it also matches in 'location'. So the first two rows in this df are fine as Germany and Germany are the same. The next three rows are then removed. So the final df should be:

animal_couples_after <- data.frame(ID=c(1,2,6,7,8,9,10,11),species=c("Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger"),location=c("Germany","Germany","Iceland","Greece","Greece","Germany","Germany","France"))

The real dataset is quite large so I don't think looping through each row would be an option.

Thanks a lot for your help.

Upvotes: 0

Views: 62

Answers (1)

arg0naut91
arg0naut91

Reputation: 14774

Could try:

library(data.table)

setDT(animal_couples)[, idx := rleid(location), by = .(species, partner)][idx == 1, ][, idx := NULL]

Output:

   ID species partner location
1:  1     Cat     Cat  Germany
2:  2     Cat     Cat  Germany
3:  6     Dog     Cat  Iceland
4:  7     Dog     Dog   Greece
5:  8     Dog     Dog   Greece
6:  9    Fish     Dog  Germany
7: 10    Fish     Dog  Germany
8: 11    Fish  Badger   France

Or also shortened:

setDT(animal_couples)[, .SD[rleid(location) == 1], by = .(species, partner)]

Upvotes: 1

Related Questions