Reputation: 373
I have a large data frame many duplicates in a single column. I am trying to parse the data frame so that only one entry per duplicate remains, UNLESS all entries are duplicates. (Couldn't find any stackoverflow answers that helped with the second part...)
Example df code:
mydf <- data.frame(accession=c("A", "A", "A", "A", "B", "B", "C", "C", "D"), gene=c("unknown", "red1", "red2", "blue", "green1", "green2", "unknown", "unknown2", "violet"), ident=c(100.0, 95.3, 80.2, 65.1, 94.2, 100.0, 97.1, 90.0, 86))
df looks like this:
accession gene ident
1 A unknown 100.0
2 A red1 95.3
3 A red2 80.2
4 A blue 65.1
5 B green1 94.2
6 B green2 100.0
7 C unknown 97.1
8 C unknown2 90.0
9 D violet 86.0
And my desired output table is this:
accession gene ident
2 A red1 95.3
6 B green2 100.0
7 C unknown 97.1
8 C unknown2 90.0
Where only one unique value for accession
is kept, based on having a "known" gene
with the highest ident
, UNLESS all duplicated entries for a particular accession
contain the string unknown*
.
I'm getting stuck at the last part -- keeping all rows for a duplicated accession
if gene
contains unknown*
. This is what I have so far:
library(dplyr)
mydf$dup <- duplicated(mydf$accession, fromLast = FALSE)|duplicated(mydf$accession, fromLast = TRUE)
mydf <- mydf %>% group_by(accession) %>% mutate(count=n())
mydf <- subset.data.frame(mydf, mydf$dup == TRUE)
mydf <- mydf %>% group_by(accession) %>% filter(!grepl("unknown", gene)) %>% top_n(1,ident)
which gives:
accession gene ident dup count
2 A red1 95.3 TRUE 4
6 B green2 100.0 TRUE 2
My instinct is to do an if
statement:
mydf <- mydf %>% group_by(accession) %>%
if(count(grepl("unknown", mydf$gene))!= mydf$count)
{filter(!grepl("unknown", gene))}
%>% top_n(1, ident)
but I'm running into an error:
Error in if (.) count(grepl("unknown", mydf$gene)) != mydf$count else { : argument is not interpretable as logical In addition: Warning message: In if (.) count(grepl("unknown", mydf$gene)) != mydf$count else { : the condition has length > 1 and only the first element will be used
What's the correct solution? I'm not married to dplyr if there a better way! Thanks!
Upvotes: 2
Views: 4062
Reputation: 214927
Another option:
1) firstly arrange data frame and sort unkown
to the end of each group and at the same time sort ident
in descending order;
2) filter per group, make sure the number of rows for the group is larger than 1, and then either the first gene
starts with unknown
which means the whole group contains unknown
since unkown
has been sorted to the end or take the first row:
mydf %>%
group_by(accession) %>%
arrange(startsWith(gene, 'unknown'), desc(ident)) %>%
filter(n() > 1 & (startsWith(first(gene), 'unknown') | row_number() == 1))
# A tibble: 4 x 3
# Groups: accession [3]
# accession gene ident
# <chr> <chr> <dbl>
#1 B green2 100.0
#2 A red1 95.3
#3 C unknown 97.1
#4 C unknown2 90.0
Upvotes: 3
Reputation: 173517
You could try this:
mydf %>%
group_by(accession) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
mutate(ident_rnk = min_rank(ident),
ident_rnk = if_else(grepl("unknown",gene),-1L,ident_rnk)) %>%
top_n(n = 1,wt = ident_rnk) %>%
select(accession,gene,ident)
Upvotes: 2