moxed
moxed

Reputation: 373

R - Group by dplyr, and remove duplicates only if ALL members in group are duplicated

I have a large data frame many duplicates in a single column. I am trying to parse the data frame so that only one entry per duplicate remains, UNLESS all entries are duplicates. (Couldn't find any stackoverflow answers that helped with the second part...)

Example df code:

mydf <- data.frame(accession=c("A", "A", "A", "A", "B", "B", "C", "C", "D"), gene=c("unknown", "red1", "red2", "blue", "green1", "green2", "unknown", "unknown2", "violet"), ident=c(100.0, 95.3, 80.2, 65.1, 94.2, 100.0, 97.1, 90.0, 86))

df looks like this:

   accession   gene      ident
1  A           unknown   100.0   
2  A           red1      95.3
3  A           red2      80.2
4  A           blue      65.1
5  B           green1    94.2
6  B           green2    100.0
7  C           unknown   97.1
8  C           unknown2  90.0
9  D           violet    86.0

And my desired output table is this:

   accession   gene      ident   
2  A           red1      95.3
6  B           green2    100.0
7  C           unknown   97.1
8  C           unknown2  90.0

Where only one unique value for accession is kept, based on having a "known" gene with the highest ident, UNLESS all duplicated entries for a particular accession contain the string unknown*.

I'm getting stuck at the last part -- keeping all rows for a duplicated accession if gene contains unknown*. This is what I have so far:

library(dplyr)
mydf$dup <- duplicated(mydf$accession, fromLast = FALSE)|duplicated(mydf$accession, fromLast = TRUE)
mydf <- mydf %>% group_by(accession) %>% mutate(count=n())
mydf <- subset.data.frame(mydf, mydf$dup == TRUE)
mydf <- mydf %>% group_by(accession) %>% filter(!grepl("unknown", gene)) %>% top_n(1,ident)

which gives:

   accession   gene      ident   dup    count   
2  A           red1      95.3    TRUE   4
6  B           green2    100.0   TRUE   2

My instinct is to do an if statement:

mydf <- mydf %>% group_by(accession) %>% 
if(count(grepl("unknown", mydf$gene))!= mydf$count)
      {filter(!grepl("unknown", gene))} 
%>% top_n(1, ident)

but I'm running into an error:

Error in if (.) count(grepl("unknown", mydf$gene)) != mydf$count else { : argument is not interpretable as logical In addition: Warning message: In if (.) count(grepl("unknown", mydf$gene)) != mydf$count else { : the condition has length > 1 and only the first element will be used

What's the correct solution? I'm not married to dplyr if there a better way! Thanks!

Upvotes: 2

Views: 4062

Answers (2)

akuiper
akuiper

Reputation: 214927

Another option:

1) firstly arrange data frame and sort unkown to the end of each group and at the same time sort ident in descending order;

2) filter per group, make sure the number of rows for the group is larger than 1, and then either the first gene starts with unknown which means the whole group contains unknown since unkown has been sorted to the end or take the first row:

mydf %>% 
    group_by(accession) %>% 
    arrange(startsWith(gene, 'unknown'), desc(ident)) %>% 
    filter(n() > 1 & (startsWith(first(gene), 'unknown') | row_number() == 1))

# A tibble: 4 x 3
# Groups:   accession [3]
#  accession     gene ident
#      <chr>    <chr> <dbl>
#1         B   green2 100.0
#2         A     red1  95.3
#3         C  unknown  97.1
#4         C unknown2  90.0

Upvotes: 3

joran
joran

Reputation: 173517

You could try this:

mydf %>%
  group_by(accession) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  mutate(ident_rnk = min_rank(ident),
         ident_rnk = if_else(grepl("unknown",gene),-1L,ident_rnk)) %>%
  top_n(n = 1,wt = ident_rnk) %>%
  select(accession,gene,ident)

Upvotes: 2

Related Questions