Reputation: 96
I would like to find rows that match in two columns but differ in a third and retain only one of these lines. So for example:
animal_couples <- data.frame(ID=c(1,2,3,4,5,6,7,8,9,10,11,12),species=c("Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger","Badger"),location=c("Germany","Germany","Iceland","France","France","Iceland","Greece","Greece","Germany","Germany","France","Spain"))
A row can match in 'species' and 'partner' so long as it also matches in 'location'. So the first two rows in this df are fine as Germany and Germany are the same. The next three rows are then removed. So the final df should be:
animal_couples_after <- data.frame(ID=c(1,2,6,7,8,9,10,11),species=c("Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger"),location=c("Germany","Germany","Iceland","Greece","Greece","Germany","Germany","France"))
The real dataset is quite large so I don't think looping through each row would be an option.
Thanks a lot for your help.
Upvotes: 0
Views: 62
Reputation: 14774
Could try:
library(data.table)
setDT(animal_couples)[, idx := rleid(location), by = .(species, partner)][idx == 1, ][, idx := NULL]
Output:
ID species partner location
1: 1 Cat Cat Germany
2: 2 Cat Cat Germany
3: 6 Dog Cat Iceland
4: 7 Dog Dog Greece
5: 8 Dog Dog Greece
6: 9 Fish Dog Germany
7: 10 Fish Dog Germany
8: 11 Fish Badger France
Or also shortened:
setDT(animal_couples)[, .SD[rleid(location) == 1], by = .(species, partner)]
Upvotes: 1