Melih O.
Melih O.

Reputation: 45

Eliminate specific rows in a dataset

I have a data frame which is in .csv format. This data frame includes 34500 rows. In this file, list of a RNAseq analysis result is present. Here the problem is some genes have multiple results and I should pick 1 entry for each gene and this entry should have the most p value. I edited my data and I have just "Gene symbol" and "p value" information.

How can i remove/eliminate rows which includes genes that should be eliminated according to my rule. I will add a screenshot which shows my problem.

Thanks in advance.

RNF144A, TTTY14, TAS2R8, KIAA0355, GCNT2 are examples of problem.

Upvotes: 1

Views: 60

Answers (1)

akrun
akrun

Reputation: 886988

Assuming that the blanks ("") correspond to repeat entries from the previous non-blank "Gene", change the blanks to NA (na_if), then use fill to change the NA to previous non-NA value, grouped by 'Gene', get the row with the max value for 'pvalue'

library(dplyr)
library(tidyr)
df1 %>%
    mutate(Gene = na_if(Gene, "")) %>%
    fill(Gene) %>%
    group_by(Gene) %>%
    slice(which.max(pvalue))

Upvotes: 1

Related Questions