zoe
zoe

Reputation: 311

Extract a percentage of columns meeting filter R

I have a df as such:

df <- data.frame(genename = c("A","B","C","D"),
             sample1 = c(10,0,50,0), 
             sample2 = c(0,30,0,70), 
             sample3 = c(50,0,0,30), 
             sample4 = c(0,0,0,10))

I want to extract the rows with at least 50% columns having >0 e.g. for df genename A and D meet the requirement

I have worked this out for all columns

df2<-as.data.frame(df[apply(df ,MARGIN=1, function(x) all(x>0)),])

but I can't work this out for a percentgae of the columns meeting the requirement???

Upvotes: 1

Views: 919

Answers (3)

ctbrown
ctbrown

Reputation: 2361

Try this:

df[ 
  apply( df[, -1], 1, function(x) sum(x>0)/length(x) > 0.5 ) , 
]

      genename sample1 sample2 sample3 sample4
1        A      10       0      50      70
4        D       0      70      30      10 

Upvotes: 0

De Novo
De Novo

Reputation: 7630

Here's a general solution:

df <- data.frame(genename = c("A","B","C","D"),               
            sample1 = c(0,10,0,0), sample2 = c(10,30,50,0), sample3=c(0,40,50,10), sample4=c(0,40,0,10))

df[(rowSums(df[-1]>0))>= (ncol(df[-1])/2),]
#   genename sample1 sample2 sample3 sample4
# 2        B      10      30      40      40
# 3        C       0      50      50       0
# 4        D       0       0      10      10

This will work for any data frame where the first colum is your gene name, and you want 50% or more of the other columns to have nonzero values.

The logic of this is as follows:

Take the data frame from the second column onward: df[-1], and turn it into a logical dataframe with TRUE where there is a value greater than 0: df[-1]>0. Then find out how many columns have TRUE in each row: rowSums(df[-1]>0). This returns a vector of length nrow(df), with values equal to the number of nonzero values in each column of the corresponding row of df. Use that to generate a logical vector of those rows with at least half of the sample values greater than 0: rowSums(df[-1]>0) >= ncol(df[-1])/2, and subset df by rows to get those rows that make the expression TRUE.

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50728

Method 1

Solution using base R:

df[apply(df[, -1], 1, function(x) sum(x > 0) / length(x)) > 0.5, ]
#  genename sample1 sample2 sample3 sample4
#1        A      10       0      50      70
#4        D       0      70      30      10

Explanation: Filter rows based on the percentage of >0 entries being >50% across all columns except the first.

Method 2

Solution using dplyr:

df %>% mutate(frac = rowSums(.[-1] > 0) / length(.[-1])) %>% filter(frac > 0.5)
#  genename sample1 sample2 sample3 sample4 frac
#1        A      10       0      50      70 0.75
#4        D       0      70      30      10 0.75

Upvotes: 1

Related Questions