SorenK
SorenK

Reputation: 155

R - select columns where at least x rows are above a certain number

Sample dataframe:

library(tibble)
df <- tribble(
  ~x, ~y,  ~z,  ~w,  ~v, 
  "a", 2,  2,   4,    3,  
  "b", 1,  3,   4,    3,
  "c", 2,  3,   1,    2,
  "d", 3,  4,   4,    2,
  "e", 2,  2,   4,    1,
  "f", 4,  1,   1,    3,
  "g", 1,  1,   1,    3,
  "h", 2,  3,   4,    2,
  "i", 3,  3,   4,    2,
  "j", 3,  3,   4,    2
)

The issue:

The above sample dataframe is but a small version of a much larger dataset I'm working with.

Here I need to identify all the columns where at least 50% (and at a later date, at least 70%) of the rows are above a certain number.

For example, in the above, I would like to identify all the columns that have at least 50% of the rows being 3 or 4 (the answer here is columns "z" and "w"). Alternatively, I would like to identify all columns that have at least 70% of the rows being 4 (the answer is then column "w" only).

For the first part (looking at 50% or more) I can make use of the median result in summary (as shown below), but I am unable to use it for identifying the subsequent question (of columns with at least 70% of rows being 4)...and it really isn't that pretty or specific when considering 200 columns!

Any and all help much appreciated, thank you!

using summary() to identify 50% or more with 3 or 4 (in an inelegant way)

> summary(df)
      x                   y             z             w              v      
 Length:10          Min.   :1.0   Min.   :1.0   Min.   :1.00   Min.   :1.0  
 Class :character   1st Qu.:2.0   1st Qu.:2.0   1st Qu.:1.75   1st Qu.:2.0  
 Mode  :character   Median :2.0   Median :3.0   Median :4.00   Median :2.0  
                    Mean   :2.3   Mean   :2.5   Mean   :3.10   Mean   :2.3  
                    3rd Qu.:3.0   3rd Qu.:3.0   3rd Qu.:4.00   3rd Qu.:3.0  
                    Max.   :4.0   Max.   :4.0   Max.   :4.00   Max.   :3.0  

Upvotes: 0

Views: 356

Answers (1)

s_baldur
s_baldur

Reputation: 33488

To find if more than 50% of the column's elements are at least 3 you could use sapply() selecting the relevant columns to check with [:

sapply(df[c('y', 'z', 'w', 'v')], function(x) mean(x >= 3) > 0.5)

For this case you could also specify the columns you want to exclude:

sapply(df[-1], function(x) mean(x >= 3) > 0.5)

To drop the columns that don't meet the requirement you could reverse the comparison and extract the relevant names:

to_drop <- names(which(sapply(df[-1], function(x) mean(x >= 3) <= 0.5)))
df[to_drop] <- NULL

Upvotes: 1

Related Questions