Reputation: 611
I have a dataframe (allDat
) that looks like the following (but more rows) and I'm trying to subset it to get individuals (samples) with the bottom 10% of Expression:
SampleID Expression Gene
HSB496 14.64295 ENSG00000118271
HSB261 14.3346 ENSG00000144820
HSB248 13.48286 ENSG00000167552
Here is what I have tried, but I feel like this is wrong or that there may be a better approach at least:
allDat_10 <- subset(allDat, Expression > quantile(Expression, prob = 10/100, na.rm = TRUE))
Upvotes: 1
Views: 2555
Reputation: 883
Use dplyr
function.
I show an example of dataset diamonds
to do the similar work.
library(tidyverse)
diamonds %>%
top_n(depth
,n = -0.1*nrow(.))
#> # A tibble: 5,625 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 3 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> 4 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62
#> 5 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59
#> 6 0.23 Very Good F VS1 60 57 402 4 4.03 2.41
#> 7 0.23 Very Good F VS1 59.8 57 402 4.04 4.06 2.42
#> 8 0.23 Very Good E VS1 59.5 58 402 4.01 4.06 2.4
#> 9 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
#> 10 0.26 Good D VS1 58.4 63 403 4.19 4.24 2.46
#> # ... with 5,615 more rows
Created on 2018-11-02 by the reprex package (v0.2.0).
devtools::session_info()
#> Session info -------------------------------------------------------------
#> setting value
#> version R version 3.5.1 (2018-07-02)
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate zh_CN.UTF-8
#> tz Asia/Shanghai
#> date 2018-11-02
#> Packages -----------------------------------------------------------------
#> package * version date source
#> assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
#> backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
#> base * 3.5.1 2018-07-05 local
#> bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
#> bindrcpp * 0.2.2 2018-03-29 CRAN (R 3.5.0)
#> broom 0.5.0 2018-07-17 CRAN (R 3.5.0)
#> cellranger 1.1.0 2016-07-27 CRAN (R 3.5.0)
#> cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
#> colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
#> compiler 3.5.1 2018-07-05 local
#> crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
#> datasets * 3.5.1 2018-07-05 local
#> devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
#> digest 0.6.16 2018-08-22 cran (@0.6.16)
#> dplyr * 0.7.6 2018-06-29 CRAN (R 3.5.1)
#> evaluate 0.11 2018-07-17 CRAN (R 3.5.0)
#> fansi 0.2.3 2018-05-06 CRAN (R 3.5.0)
#> forcats * 0.3.0 2018-02-19 CRAN (R 3.5.0)
#> ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.5.0)
#> glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
#> graphics * 3.5.1 2018-07-05 local
#> grDevices * 3.5.1 2018-07-05 local
#> grid 3.5.1 2018-07-05 local
#> gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
#> haven 1.1.2 2018-06-27 CRAN (R 3.5.0)
#> hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
#> httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
#> jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
#> knitr 1.20 2018-02-20 CRAN (R 3.5.0)
#> lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
#> lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
#> lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
#> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
#> methods * 3.5.1 2018-07-05 local
#> modelr 0.1.2 2018-05-11 CRAN (R 3.5.0)
#> munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
#> nlme 3.1-137 2018-04-07 CRAN (R 3.5.1)
#> pillar 1.3.0 2018-07-14 CRAN (R 3.5.0)
#> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
#> plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
#> purrr * 0.2.5 2018-05-29 CRAN (R 3.5.0)
#> R6 2.3.0 2018-10-04 cran (@2.3.0)
#> Rcpp 0.12.19 2018-10-01 cran (@0.12.19)
#> readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
#> readxl 1.1.0 2018-04-20 CRAN (R 3.5.0)
#> rlang 0.2.2 2018-08-16 cran (@0.2.2)
#> rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0)
#> rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
#> rvest 0.3.2 2016-06-17 CRAN (R 3.5.0)
#> scales 1.0.0 2018-08-09 CRAN (R 3.5.0)
#> stats * 3.5.1 2018-07-05 local
#> stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
#> stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0)
#> tibble * 1.4.2 2018-01-22 CRAN (R 3.5.0)
#> tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.0)
#> tidyselect 0.2.5 2018-10-11 cran (@0.2.5)
#> tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.0)
#> tools 3.5.1 2018-07-05 local
#> utf8 1.1.4 2018-05-24 CRAN (R 3.5.0)
#> utils * 3.5.1 2018-07-05 local
#> withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
#> xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
#> yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
Upvotes: 0
Reputation: 48211
Using (I fixed the sign and replaced expr
with Expression
)
subset(allDat, Expression < quantile(Expression, prob = 0.1, na.rm = TRUE))
may be fine; it depends on what exactly you mean by 10% of values. If you had 100 rows, do you want the result to contain 10 rows? If so, then perhaps you actually want
subset(allDat, Expression %in% sort(Expression)[1:round(0.1 * length(Expression))])
Those two approaches are not the same. The latter will return 10%~ of all the rows, while the first one may return even an empty data frame! For instance,
allDat <- allDat[c(1, 2, rep(3, 10)), ]
subset(allDat, Expression < quantile(Expression, prob = 0.1, na.rm = TRUE))
# [1] SampleID Expression Gene
# <0 rows> (or 0-length row.names)
Now if you replaced <
by <=
, the result would contain 10 rows, while allDat
itself has 12 rows.
So, use quantile
if you are thinking about the theoretical distribution of Expression
and have enough data (to approximate it properly), and use sort
if you want a fixed number of rows.
Upvotes: 3