Reputation: 45
I have a data frame which is in .csv format. This data frame includes 34500 rows. In this file, list of a RNAseq analysis result is present. Here the problem is some genes have multiple results and I should pick 1 entry for each gene and this entry should have the most p value. I edited my data and I have just "Gene symbol" and "p value" information.
How can i remove/eliminate rows which includes genes that should be eliminated according to my rule. I will add a screenshot which shows my problem.
Thanks in advance.
Upvotes: 1
Views: 60
Reputation: 886988
Assuming that the blanks (""
) correspond to repeat entries from the previous non-blank "Gene", change the blanks to NA
(na_if
), then use fill
to change the NA to previous non-NA value, grouped by 'Gene', get the row with the max
value for 'pvalue'
library(dplyr)
library(tidyr)
df1 %>%
mutate(Gene = na_if(Gene, "")) %>%
fill(Gene) %>%
group_by(Gene) %>%
slice(which.max(pvalue))
Upvotes: 1