Reputation: 155
Sample dataframe:
library(tibble)
df <- tribble(
~x, ~y, ~z, ~w, ~v,
"a", 2, 2, 4, 3,
"b", 1, 3, 4, 3,
"c", 2, 3, 1, 2,
"d", 3, 4, 4, 2,
"e", 2, 2, 4, 1,
"f", 4, 1, 1, 3,
"g", 1, 1, 1, 3,
"h", 2, 3, 4, 2,
"i", 3, 3, 4, 2,
"j", 3, 3, 4, 2
)
The issue:
The above sample dataframe is but a small version of a much larger dataset I'm working with.
Here I need to identify all the columns where at least 50% (and at a later date, at least 70%) of the rows are above a certain number.
For example, in the above, I would like to identify all the columns that have at least 50% of the rows being 3 or 4 (the answer here is columns "z"
and "w"
).
Alternatively, I would like to identify all columns that have at least 70% of the rows being 4 (the answer is then column "w"
only).
For the first part (looking at 50% or more) I can make use of the median result in summary (as shown below), but I am unable to use it for identifying the subsequent question (of columns with at least 70% of rows being 4)...and it really isn't that pretty or specific when considering 200 columns!
Any and all help much appreciated, thank you!
using summary() to identify 50% or more with 3 or 4 (in an inelegant way)
> summary(df)
x y z w v
Length:10 Min. :1.0 Min. :1.0 Min. :1.00 Min. :1.0
Class :character 1st Qu.:2.0 1st Qu.:2.0 1st Qu.:1.75 1st Qu.:2.0
Mode :character Median :2.0 Median :3.0 Median :4.00 Median :2.0
Mean :2.3 Mean :2.5 Mean :3.10 Mean :2.3
3rd Qu.:3.0 3rd Qu.:3.0 3rd Qu.:4.00 3rd Qu.:3.0
Max. :4.0 Max. :4.0 Max. :4.00 Max. :3.0
Upvotes: 0
Views: 356
Reputation: 33488
To find if more than 50% of the column's elements are at least 3 you could use sapply()
selecting the relevant columns to check with [
:
sapply(df[c('y', 'z', 'w', 'v')], function(x) mean(x >= 3) > 0.5)
For this case you could also specify the columns you want to exclude:
sapply(df[-1], function(x) mean(x >= 3) > 0.5)
To drop the columns that don't meet the requirement you could reverse the comparison and extract the relevant names:
to_drop <- names(which(sapply(df[-1], function(x) mean(x >= 3) <= 0.5)))
df[to_drop] <- NULL
Upvotes: 1