Andrei Bieger
Andrei Bieger

Reputation: 35

Filtering out columns from an R data.frame based on the sum of its values for all rows

I have a data.frame df where observations are cities and each column describes the amount of a certain pesticide used in that city (around 300 of them). The data.frame looks like this:

 head(df)
# A tibble: 6 x 11
  Benzovindiflupir Beta_ciflutrina Beta_Cipermetrina Bicarbonato_de_potássio Bifentrina Bispiribaque_sódi~ Bixafem Boscalida
             <dbl>           <dbl>             <dbl>                   <dbl>      <dbl>              <dbl>   <dbl>     <dbl>
1            0.354           1.10           0.0219                         0      4.99             0.0155  0.0686        0  
2           11.3             2.43           0.130                          0      8.84             0       2.08         22.2
3            0.329           0.267          0                              0      1.86             0.00746 0.00583       0  
4            4.56            2.71           0                              0      7.39             0       1.04          0  
5            0.925           0.512          0                              0      0.612            0       0.119         0  
6            0.388           0.149          0.000641                       0      1.61             0.143   0.0777        0  

I would like to remove all pesticides (columns) whose sum of that pesticide use in all cities is below a threshold. I called this threshold pestCutoff.

pestCutoff <- 10000

So I want to create a new data.frame called subDf with all rows and only the columns whose sum of all rows is >= cutoff, but I haven`t found a way of filtering/subsetting/selecting columns based on a logical rule.

Upvotes: 2

Views: 1615

Answers (3)

Yusuf Nezih
Yusuf Nezih

Reputation: 1

we may also use select_if with sum :

library(dplyr)

subDf <- df %>% select_if(~sum(.) >= pestCutoff)

Upvotes: 0

TarJae
TarJae

Reputation: 79204

We could also use which(colSums)

library(dplyr)
subDf <- df %>% select(which(colSums(.) >= pestCutoff)) 

data:

df <- tribble(
~Benzovindiflupir, ~Beta_ciflutrina, ~Beta_Cipermetrina, ~Bicarbonato_de_potássio, ~Bifentrina, ~Bispiribaque_sódi, ~Bixafem, ~Boscalida,
0.354, 1.10, 0.0219, 0, 4.99, 0.0155, 0.0686, 0, 
11.3, 2.43, 0.130, 0, 8.84, 0, 2.08, 22.2,
0.329, 0.267, 0, 0, 1.86, 0.00746, 0.00583, 0, 
4.56, 2.71, 0, 0, 7.39, 0, 1.04, 0, 
0.925, 0.512, 0, 0, 0.612, 0, 0.119, 0,
0.388, 0.149, 0.000641, 0, 1.61, 0.143, 0.0777, 0)

Upvotes: 1

akrun
akrun

Reputation: 887851

We can use select with a condition on the sum i.e. if the sum of that column greater than threshold, then select it

library(dplyr)
subDf <- df %>%
    select(where( ~ sum(.) >= pestCutoff))

NOTE: Here we assume that the condition should be applied to all columns and all of them are numeric. If there are other columns types and do not want to select, create a short circuit

subDf <- df %>%
          select(where(~ is.numeric(.) && sum(.) >= pestCutoff))

Or in base R with Filter

Filter(function(x) sum(x) >= pestCutoff, df)

Or with colSums

df[colSums(df) >= pestCutoff]

Upvotes: 1

Related Questions