Reputation: 611

How do I get the bottom 10% of values in a dataframe column?

I have a dataframe (allDat) that looks like the following (but more rows) and I'm trying to subset it to get individuals (samples) with the bottom 10% of Expression:

SampleID    Expression   Gene
HSB496      14.64295     ENSG00000118271
HSB261      14.3346      ENSG00000144820
HSB248      13.48286     ENSG00000167552

Here is what I have tried, but I feel like this is wrong or that there may be a better approach at least:

allDat_10 <- subset(allDat, Expression > quantile(Expression, prob = 10/100, na.rm = TRUE))

Upvotes: 1

Answers (2)

Jiaxiang

Reputation: 883

Use dplyr function. I show an example of dataset diamonds to do the similar work.

library(tidyverse)
diamonds %>% 
    top_n(depth
          ,n = -0.1*nrow(.))
#> # A tibble: 5,625 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  2  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  3  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#>  4  0.31 Very Good J     SI1      59.4    62   353  4.39  4.43  2.62
#>  5  0.31 Very Good J     SI1      58.1    62   353  4.44  4.47  2.59
#>  6  0.23 Very Good F     VS1      60      57   402  4     4.03  2.41
#>  7  0.23 Very Good F     VS1      59.8    57   402  4.04  4.06  2.42
#>  8  0.23 Very Good E     VS1      59.5    58   402  4.01  4.06  2.4 
#>  9  0.23 Good      F     VS1      58.2    59   402  4.06  4.08  2.37
#> 10  0.26 Good      D     VS1      58.4    63   403  4.19  4.24  2.46
#> # ... with 5,615 more rows

Created on 2018-11-02 by the reprex package (v0.2.0).

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  zh_CN.UTF-8                 
#>  tz       Asia/Shanghai               
#>  date     2018-11-02
#> Packages -----------------------------------------------------------------
#>  package    * version date       source         
#>  assertthat   0.2.0   2017-04-11 CRAN (R 3.5.0) 
#>  backports    1.1.2   2017-12-13 CRAN (R 3.5.0) 
#>  base       * 3.5.1   2018-07-05 local          
#>  bindr        0.1.1   2018-03-13 CRAN (R 3.5.0) 
#>  bindrcpp   * 0.2.2   2018-03-29 CRAN (R 3.5.0) 
#>  broom        0.5.0   2018-07-17 CRAN (R 3.5.0) 
#>  cellranger   1.1.0   2016-07-27 CRAN (R 3.5.0) 
#>  cli          1.0.0   2017-11-05 CRAN (R 3.5.0) 
#>  colorspace   1.3-2   2016-12-14 CRAN (R 3.5.0) 
#>  compiler     3.5.1   2018-07-05 local          
#>  crayon       1.3.4   2017-09-16 CRAN (R 3.5.0) 
#>  datasets   * 3.5.1   2018-07-05 local          
#>  devtools     1.13.6  2018-06-27 CRAN (R 3.5.0) 
#>  digest       0.6.16  2018-08-22 cran (@0.6.16) 
#>  dplyr      * 0.7.6   2018-06-29 CRAN (R 3.5.1) 
#>  evaluate     0.11    2018-07-17 CRAN (R 3.5.0) 
#>  fansi        0.2.3   2018-05-06 CRAN (R 3.5.0) 
#>  forcats    * 0.3.0   2018-02-19 CRAN (R 3.5.0) 
#>  ggplot2    * 3.0.0   2018-07-03 CRAN (R 3.5.0) 
#>  glue         1.3.0   2018-07-17 CRAN (R 3.5.0) 
#>  graphics   * 3.5.1   2018-07-05 local          
#>  grDevices  * 3.5.1   2018-07-05 local          
#>  grid         3.5.1   2018-07-05 local          
#>  gtable       0.2.0   2016-02-26 CRAN (R 3.5.0) 
#>  haven        1.1.2   2018-06-27 CRAN (R 3.5.0) 
#>  hms          0.4.2   2018-03-10 CRAN (R 3.5.0) 
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.5.0) 
#>  httr         1.3.1   2017-08-20 CRAN (R 3.5.0) 
#>  jsonlite     1.5     2017-06-01 CRAN (R 3.5.0) 
#>  knitr        1.20    2018-02-20 CRAN (R 3.5.0) 
#>  lattice      0.20-35 2017-03-25 CRAN (R 3.5.1) 
#>  lazyeval     0.2.1   2017-10-29 CRAN (R 3.5.0) 
#>  lubridate    1.7.4   2018-04-11 CRAN (R 3.5.0) 
#>  magrittr     1.5     2014-11-22 CRAN (R 3.5.0) 
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.5.0) 
#>  methods    * 3.5.1   2018-07-05 local          
#>  modelr       0.1.2   2018-05-11 CRAN (R 3.5.0) 
#>  munsell      0.5.0   2018-06-12 CRAN (R 3.5.0) 
#>  nlme         3.1-137 2018-04-07 CRAN (R 3.5.1) 
#>  pillar       1.3.0   2018-07-14 CRAN (R 3.5.0) 
#>  pkgconfig    2.0.1   2017-03-21 CRAN (R 3.5.0) 
#>  plyr         1.8.4   2016-06-08 CRAN (R 3.5.0) 
#>  purrr      * 0.2.5   2018-05-29 CRAN (R 3.5.0) 
#>  R6           2.3.0   2018-10-04 cran (@2.3.0)  
#>  Rcpp         0.12.19 2018-10-01 cran (@0.12.19)
#>  readr      * 1.1.1   2017-05-16 CRAN (R 3.5.0) 
#>  readxl       1.1.0   2018-04-20 CRAN (R 3.5.0) 
#>  rlang        0.2.2   2018-08-16 cran (@0.2.2)  
#>  rmarkdown    1.10    2018-06-11 CRAN (R 3.5.0) 
#>  rprojroot    1.3-2   2018-01-03 CRAN (R 3.5.0) 
#>  rvest        0.3.2   2016-06-17 CRAN (R 3.5.0) 
#>  scales       1.0.0   2018-08-09 CRAN (R 3.5.0) 
#>  stats      * 3.5.1   2018-07-05 local          
#>  stringi      1.2.4   2018-07-20 CRAN (R 3.5.0) 
#>  stringr    * 1.3.1   2018-05-10 CRAN (R 3.5.0) 
#>  tibble     * 1.4.2   2018-01-22 CRAN (R 3.5.0) 
#>  tidyr      * 0.8.1   2018-05-18 CRAN (R 3.5.0) 
#>  tidyselect   0.2.5   2018-10-11 cran (@0.2.5)  
#>  tidyverse  * 1.2.1   2017-11-14 CRAN (R 3.5.0) 
#>  tools        3.5.1   2018-07-05 local          
#>  utf8         1.1.4   2018-05-24 CRAN (R 3.5.0) 
#>  utils      * 3.5.1   2018-07-05 local          
#>  withr        2.1.2   2018-03-15 CRAN (R 3.5.0) 
#>  xml2         1.2.0   2018-01-24 CRAN (R 3.5.0) 
#>  yaml         2.2.0   2018-07-25 CRAN (R 3.5.0)

Upvotes: 0

Julius Vainora

Reputation: 48211

Using (I fixed the sign and replaced expr with Expression)

subset(allDat, Expression < quantile(Expression, prob = 0.1, na.rm = TRUE))

may be fine; it depends on what exactly you mean by 10% of values. If you had 100 rows, do you want the result to contain 10 rows? If so, then perhaps you actually want

subset(allDat, Expression %in% sort(Expression)[1:round(0.1 * length(Expression))])

Those two approaches are not the same. The latter will return 10%~ of all the rows, while the first one may return even an empty data frame! For instance,

allDat <- allDat[c(1, 2, rep(3, 10)), ]
subset(allDat, Expression < quantile(Expression, prob = 0.1, na.rm = TRUE))
# [1] SampleID   Expression Gene      
# <0 rows> (or 0-length row.names)

Now if you replaced < by <=, the result would contain 10 rows, while allDat itself has 12 rows.

So, use quantile if you are thinking about the theoretical distribution of Expression and have enough data (to approximate it properly), and use sort if you want a fixed number of rows.

Upvotes: 3

How do I get the bottom 10% of values in a dataframe column?

Answers (2)

Related Questions