micron
micron

Reputation: 21

R filtering a dataframe for a proportion of columns meeting criteria

I'm sure the answer to this question is out there already, but I can't find it, since I'm a beginner at R and don't know what search terms to use.

I want to retrieve the rows in a data frame where a given proportion of the columns meet a criteria. For example, 2/3 columns >1.3.

Here is what I have so far:

a<-c(1.1,1.2,1.3,1.4,1.5)
b<-c(1.3,1.4,1.5,1.6,1.7)
c<-c(1.5,1.6,1.7,1.8,1.9)
data<-data.frame(a,b,c)
data`

   a   b   c
1 1.1 1.3 1.5
2 1.2 1.4 1.6
3 1.3 1.5 1.7
4 1.4 1.6 1.8
5 1.5 1.7 1.9


c<-function(x) (length(x[(x>1.4)]))>=(2/3*ncol(data))
d<-apply(data,1,c)
result<-data[d,]
result

   a   b   c
3 1.3 1.5 1.7
4 1.4 1.6 1.8
5 1.5 1.7 1.9

This works, but I feel like there must be a simpler way, or that the function could be written differently? I'm still trying to properly undestand this whole function-thing.

Of course, in reality my dataframe would have alot of columns.

/Grateful beginner

Upvotes: 2

Views: 148

Answers (2)

Jase_
Jase_

Reputation: 1196

Just to give another alternative to David's answer. You can use the mean function on a vector of logical values in R to return the percentage of TRUE values in the vector.

Create the data

a<-c(1.1, 1.2, 1.3, 1.4, 1.5)
b<-c(1.3, 1.4, 1.5, 1.6, 1.7)
c<-c(1.5, 1.6, 1.7, 1.8, 1.9)
data<-data.frame(a, b, c)

A function to return a logical vector indicating if the values are above the threshold

gt <- function(x, threshold){
  tmp <- x > threshold
  return(tmp)
}

An example using the first row of the data.frame

gt(data[1,], 1.4)

If you take the sum of the logical vector it returns the number of TRUE instances:

sum(gt(data[1,], 1.4))
# [1] 1

and if you use the mean function it returns the percentage of positive instances:

mean(gt(data[1,], 1.4))
# [1] 0.3333333

Using that you can use David's approach:

index <- apply(data,1, function(x) sum(gt(x, 1.4)) >= 2/3 * length(x))

or you can use the percentage via the mean function.

index <- apply(data,1, function(x) mean(gt(x, 1.4)) > 0.6)

Upvotes: 0

David Arenburg
David Arenburg

Reputation: 92300

Maybe (Should be more efficient as rowSums is vectorized and saves the need in using apply loop)

data[rowSums(data > 1.4) >= 2/3*ncol(data),]

##     a   b   c
## 3 1.3 1.5 1.7
## 4 1.4 1.6 1.8
## 5 1.5 1.7 1.9

Or if you prefer a function, could try

myfunc <- function(x) x[rowSums(x > 1.4) >= 2/3*ncol(x), ]
myfunc(data)

##     a   b   c
## 3 1.3 1.5 1.7
## 4 1.4 1.6 1.8
## 5 1.5 1.7 1.9

Upvotes: 1

Related Questions