Reputation: 11

Filter R data.frame by multiple columns

I have an R data frame that contains many observations and looks like:

df <- data.frame(obs1=c(7.1,8.3,9.8), 
                 obs2=c(5.2,8.8,4.1), 
                 obs3=c(9.6,8.1,7.7), 
                 obs4=c(7.2,8.1,9.4), 
                 obs5=c(NA,5.4,9.0), 
                 hi1=c(9.6,8.8,9.8), 
                 hi2=c(7.2,8.3,9.4))

I simplified, as obs goes out to obs25. hi1 and hi2 contain the highest and next highest values in each row. I need to get all the rows with obs* > x but less than hi1 or hi2. In other words, all the rows that have values above a threshold but were not the 2 highest values. Thanks!

Sorry for not being more clear. For example, if the threshold is set at 8 and the above dataframe is used the result would be rows 2 and 3:

in row 2, obs3 and obs4 are > 8, but less than 2 highest

in row 3, obs5 > 8, but less than 2 highest

Upvotes: 0

Answers (2)

Robert Casey

Reputation: 1581

I want to explore one possibility here, and that's to find columns that comparatively are less than the two highest columns, each by their respective row element, and less than a threshold which is also a vector to compare element by element. The below example adds a few observations so we can actually see some results come through:

df <- data.frame(obs1=c(7.1,8.3,9.8), 
               obs2=c(5.2,8.8,4.1), 
               obs3=c(9.6,8.1,7.7), 
               obs4=c(7.2,8.1,9.4), 
               obs5=c(NA,5.4,9.0),
               obs6=c(6.6,7.3,8.8),
               obs7=c(1.1,6.7,9.0),
               obs8=c(8.8,8.4,9.6),
               obs9=c(6.0,7.8,8.3),
               hi1=c(9.6,8.8,9.8), 
               hi2=c(7.2,8.3,9.4))

x is our threshold, which is a vector

x <- c(5.0,5.0,5.0)

now we apply on each column a comparison to the lowest of the two hi columns, combined with a comparison to the threshold. The logical vector is then run as a product, so that a 1 is only reported if all elements are TRUE. e is a logical vector of the columns we want to show.

e <- as.logical(sapply(df, function(y) prod(ifelse(y < df$hi2 & y > x,TRUE,FALSE))>0))

subset our df by column

dfnew <- df[,which(e)]

So if I look at the end result:

dfnew
  obs6 obs9
1  6.6  6.0
2  7.3  7.8
3  8.8  8.3

Upvotes: 0

AidanGawronski

Reputation: 2085

Note that there are no rows meeting the criteria you seem to describe (in this example):

df <- data.frame(obs1=c(7.1,8.3,9.8), 
                 obs2=c(5.2,8.8,4.1), 
                 obs3=c(9.6,8.1,7.7), 
                 obs4=c(7.2,8.1,9.4), 
                 obs5=c(NA,5.4,9.0), 
                 hi1=c(9.6,8.8,9.8), 
                 hi2=c(7.2,8.3,9.4))

x <- 5

#rows which have a min value greater than x
df[which(apply(df[,-c(6:7)], 1, min) > x,),]

#rows which have a max value less than h2
df[which(apply(df[,-c(6:7)], 1, max) < df$h12,),]

#rows which have both
df[intersect(which(apply(df[,-c(6:7)], 1, min) > x,), which(apply(df[,-c(6:7)], 1, max) < df$h12,)),]

Upvotes: 1

Filter R data.frame by multiple columns

Answers (2)

Related Questions