Reputation: 311
I have a df as such:
df <- data.frame(genename = c("A","B","C","D"),
sample1 = c(10,0,50,0),
sample2 = c(0,30,0,70),
sample3 = c(50,0,0,30),
sample4 = c(0,0,0,10))
I want to extract the rows with at least 50% columns having >0 e.g. for df genename A and D meet the requirement
I have worked this out for all columns
df2<-as.data.frame(df[apply(df ,MARGIN=1, function(x) all(x>0)),])
but I can't work this out for a percentgae of the columns meeting the requirement???
Upvotes: 1
Views: 919
Reputation: 2361
Try this:
df[
apply( df[, -1], 1, function(x) sum(x>0)/length(x) > 0.5 ) ,
]
genename sample1 sample2 sample3 sample4
1 A 10 0 50 70
4 D 0 70 30 10
Upvotes: 0
Reputation: 7630
Here's a general solution:
df <- data.frame(genename = c("A","B","C","D"),
sample1 = c(0,10,0,0), sample2 = c(10,30,50,0), sample3=c(0,40,50,10), sample4=c(0,40,0,10))
df[(rowSums(df[-1]>0))>= (ncol(df[-1])/2),]
# genename sample1 sample2 sample3 sample4
# 2 B 10 30 40 40
# 3 C 0 50 50 0
# 4 D 0 0 10 10
This will work for any data frame where the first colum is your gene name, and you want 50% or more of the other columns to have nonzero values.
The logic of this is as follows:
Take the data frame from the second column onward: df[-1]
, and turn it into a logical dataframe with TRUE
where there is a value greater than 0: df[-1]>0
. Then find out how many columns have TRUE
in each row: rowSums(df[-1]>0)
. This returns a vector of length nrow(df)
, with values equal to the number of nonzero values in each column of the corresponding row of df
. Use that to generate a logical vector of those rows with at least half of the sample values greater than 0: rowSums(df[-1]>0) >= ncol(df[-1])/2
, and subset df
by rows to get those rows that make the expression TRUE
.
Upvotes: 0
Reputation: 50728
Solution using base R:
df[apply(df[, -1], 1, function(x) sum(x > 0) / length(x)) > 0.5, ]
# genename sample1 sample2 sample3 sample4
#1 A 10 0 50 70
#4 D 0 70 30 10
Explanation: Filter rows based on the percentage of >0
entries being >50%
across all columns except the first.
Solution using dplyr
:
df %>% mutate(frac = rowSums(.[-1] > 0) / length(.[-1])) %>% filter(frac > 0.5)
# genename sample1 sample2 sample3 sample4 frac
#1 A 10 0 50 70 0.75
#4 D 0 70 30 10 0.75
Upvotes: 1