Adam Price
Adam Price

Reputation: 850

How to remove rows from a data frame when a % of columns have a value less than specified?

I have some data that I want to filter. I want to be able to say, "If a specified percentage of each row contains a value less than indicated, remove that row from the data frame.

Here is some sample data.

       Sample1, Sample2, Sample3, Sample4, Sample5, Sample6
Item1   0   0   0   0   0   0
Item2   478 440 522 578 1066 1045
Item3   16  14  9   6   6   20

Let's say I want rows with 50% of columns with a value of less than 10 to be removed. So in that scenario Item1 row is removed, and Item3 row is removed.

If I change the criteria to be 50% of columns with a value of less than 7, then only Item1 goes, and Item2 and Item3 remain.

What's a neat way to accomplish this in R? This is a simple issue and I want to avoid writing messy code to accomplish it. From what I've read so far I should be doing this with lapply() maybe? I appreciate any insight.

Upvotes: 0

Views: 650

Answers (2)

Eric Watt
Eric Watt

Reputation: 3230

library(data.table)

dat <- fread("Item Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
              Item1   0   0   0   0   0   0
              Item2   478 440 522 578 1066 1045
              Item3   16  14  9   6   6   20")    

slice_val <- 10
dat[apply(dat[, !"Item"], 1, function(x) sum(x > slice_val)/length(x)) > 0.5]

    Item Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
1: Item2     478     440     522     578    1066    1045

slice_val <- 7
dat[apply(dat[, !"Item"], 1, function(x) sum(x > slice_val)/length(x)) > 0.5]

    Item Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
1: Item2     478     440     522     578    1066    1045
2: Item3      16      14       9       6       6      20

Upvotes: 1

G5W
G5W

Reputation: 37641

You can do this just by indexing.

## reproduce your data
df = read.table(text="ItemNum Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
Item1   0   0   0   0   0   0
Item2   478 440 522 578 1066 1045
Item3   16  14  9   6   6   20",
header=TRUE, stringsAsFactors=FALSE)

df = df[which(rowSums(df[,2:7] < 10) < 3), ]
df
   ItemNum Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
2   Item2     478     440     522     578    1066    1045

Upvotes: 1

Related Questions