R - Group by dplyr, and remove duplicates only if ALL members in group are duplicated

Question

I have a large data frame many duplicates in a single column. I am trying to parse the data frame so that only one entry per duplicate remains, UNLESS all entries are duplicates. (Couldn't find any stackoverflow answers that helped with the second part...)

Example df code:

mydf <- data.frame(accession=c("A", "A", "A", "A", "B", "B", "C", "C", "D"), gene=c("unknown", "red1", "red2", "blue", "green1", "green2", "unknown", "unknown2", "violet"), ident=c(100.0, 95.3, 80.2, 65.1, 94.2, 100.0, 97.1, 90.0, 86))

df looks like this:

   accession   gene      ident
1  A           unknown   100.0   
2  A           red1      95.3
3  A           red2      80.2
4  A           blue      65.1
5  B           green1    94.2
6  B           green2    100.0
7  C           unknown   97.1
8  C           unknown2  90.0
9  D           violet    86.0

And my desired output table is this:

   accession   gene      ident   
2  A           red1      95.3
6  B           green2    100.0
7  C           unknown   97.1
8  C           unknown2  90.0

Where only one unique value for accession is kept, based on having a "known" gene with the highest ident, UNLESS all duplicated entries for a particular accession contain the string unknown*.

I'm getting stuck at the last part -- keeping all rows for a duplicated accession if gene contains unknown*. This is what I have so far:

library(dplyr)
mydf$dup <- duplicated(mydf$accession, fromLast = FALSE)|duplicated(mydf$accession, fromLast = TRUE)
mydf <- mydf %>% group_by(accession) %>% mutate(count=n())
mydf <- subset.data.frame(mydf, mydf$dup == TRUE)
mydf <- mydf %>% group_by(accession) %>% filter(!grepl("unknown", gene)) %>% top_n(1,ident)

which gives:

   accession   gene      ident   dup    count   
2  A           red1      95.3    TRUE   4
6  B           green2    100.0   TRUE   2

My instinct is to do an if statement:

mydf <- mydf %>% group_by(accession) %>% 
if(count(grepl("unknown", mydf$gene))!= mydf$count)
      {filter(!grepl("unknown", gene))} 
%>% top_n(1, ident)

but I'm running into an error:

Error in if (.) count(grepl("unknown", mydf$gene)) != mydf$count else { : argument is not interpretable as logical In addition: Warning message: In if (.) count(grepl("unknown", mydf$gene)) != mydf$count else { : the condition has length > 1 and only the first element will be used

What's the correct solution? I'm not married to dplyr if there a better way! Thanks!

joran · Accepted Answer

You could try this:

mydf %>%
  group_by(accession) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  mutate(ident_rnk = min_rank(ident),
         ident_rnk = if_else(grepl("unknown",gene),-1L,ident_rnk)) %>%
  top_n(n = 1,wt = ident_rnk) %>%
  select(accession,gene,ident)

R - Group by dplyr, and remove duplicates only if ALL members in group are duplicated

Answers (2)

Related Questions