lizz0427
lizz0427

Reputation: 31

R: Remove rows with fewer than certain threshold non-zero values

I would like to know how to remove rows from a data frame that have fewer than (let's say 5) non-zero entries.

The closest I've come is:

length(which(df[1,] > 0)) >= 5

but how to apply this to the whole data frame and drop the ones that are FALSE? Is there a function similar to the COUNTIF() function in excel that I can apply here?

Thank you for your help.

Upvotes: 3

Views: 4248

Answers (2)

milan
milan

Reputation: 4970

You can also use a for-loop.

We first create a matrix of zero's and one's to test our code. Row 2 has to be excluded because it has less than 5 non-zero values.

In the loop we count the number of non-zero values per row, and assign TRUE if this is less than 5 (FALSE otherwise). The vector named 'drop' holds the information for which row is TRUE then FALSE. In the final step, we exclude those rows for which drop==TRUE.

mat <- matrix(c(1,1,1,1,0,1,1,1,1,1,1,1,1,1,1), nrow=3, ncol=5)
mat

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    0    1    1    1
[3,]    1    1    1    1    1

drop <- NULL
for(i in 1:NROW(mat)){
  count.non.zero <- sum(mat[i,]!=0, na.rm=TRUE)
  drop <- c(drop, count.non.zero<5)
} 

mat[!drop==TRUE,]

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    1    1    1    1

NOTE: na.rm==TRUE allows this script to work when your data contains missing values.

Upvotes: 0

bergant
bergant

Reputation: 7232

You can use boolean values in rowSums and in [:

 df[ rowSums(df > 0) >= 5, ]

There are 3 steps hidden in this expression:

  • expression df > 0 produces a matrix with values TRUE where element > 0
  • Function rowSums returns number of nonzero elements for every line (when summing it treats values TRUE as 1 and FALSE as 0)
  • finally [ selects only lines where the number of non-zero elements >= 5

Upvotes: 3

Related Questions