Emily
Emily

Reputation: 500

Delete duplicates within groups?

Code for example data:

    mydf<-data.frame(Group_ID=c("337", "337", "201", "201", "470", "470", "999", "999"), 
             Timestamp=c("A", "A", "B", "B", "C", "D", "E", "F"), 
             MU=c("1", "1", "2", "3", "4", "4", "5", "6"))

Example dataframe:

  Group_ID Timestamp MU
1      337         A  1
2      337         A  1
3      201         B  2
4      201         B  3
5      470         C  4
6      470         D  4
7      999         E  5
8      999         F  6

Within "Group_ID", I want to retain only entries where neither "Timestamp" nor "MU" is duplicated. Thus, in this example, only rows 7 & 8 would be retained ("Group_ID" 999 has unique entries for both "Timestamp" and "MU").

Some of my attempts:

mydf<-mydf %>%
  group_by(Group_ID) %>%
  filter(unique(Timestamp))

Returns error:

"Argument 2 filter condition does not evaluate to a logical vector"

If this did work, I'd then run it again using unique(MU)

mydf<-mydf %>%
  group_by(Group_ID) %>%
  mydf[!duplicated(mydf$Timestamp, fromLast = TRUE),]

Returns error:

"Length of logical index vector must be 1 or 3 (the number of columns), not 8"

(Again, I'd run the code again using MU)

I've looked through similar questions but haven't found one with the same scenario. Many thanks in advance.

Upvotes: 3

Views: 340

Answers (3)

moodymudskipper
moodymudskipper

Reputation: 47300

Here's a base solution :

is.unique <- function(x) !(duplicated(x) | duplicated(x, fromLast = TRUE))
mydf[is.unique(mydf[1:2]) & is.unique(mydf[c(1,3)]),]
#>   Group_ID Timestamp MU
#> 7      999         E  5
#> 8      999         F  6

Upvotes: 0

akrun
akrun

Reputation: 886938

If we are using filter, it expects a logical vector. The output of unique is just unique elemnents of that column (character class). So it is not going to work. We can use duplicated to get a logical vector of duplicate elements, negate (!). so that the TRUE -> FALSE and viceversa to get only the first non-duplicate element

library(dplyr)
mydf %>% 
   group_by(Group_ID) %>% 
   filter(!(duplicated(Timestamp, fromLast = TRUE)| duplicated(Timestamp))) 

Or group by both 'Group_ID', 'Timestamp' and filter based on the number of rows

mydf %>%
   group_by(Group_ID, Timestamp) %>%
   filter(n() == 1)

If we need only the '999' 'Group_ID'

mydf %>% 
  group_by(Group_ID) %>%
  filter_at(vars(Timestamp,  MU),  all_vars(n_distinct(.) == n()))
# A tibble: 2 x 3
# Groups:   Group_ID [1]
#  Group_ID Timestamp MU   
#  <fct>    <fct>     <fct>
#1 999      E         5    
#2 999      F         6    

Or use distinct directly

distinct(mydf, Group_ID, Timestamp, .keep_all = TRUE)

Upvotes: 5

d.b
d.b

Reputation: 32538

foo = function(x, f){
    ave(as.numeric(as.factor(x)),
        f,
        FUN = function(y) length(unique(y)) == length(y))
}

inds = Reduce("&", lapply(mydf[c("Timestamp", "MU")],
                          function(x) foo(x, mydf$Group_ID) == 1))

mydf[inds,]
#  Group_ID Timestamp MU
#7      999         E  5
#8      999         F  6

Upvotes: 2

Related Questions