I'm currently facing some challenges when dealing with a lot of dataframes with variable lengths and non-regular colnames. The challenge is to remove unwanted rows (here rows for samples sequenced as Whole genome shotgun sequencing) matching multiple keywords, indeed it would we to easy to have a single keyword ... For that purpose I'm unsing filter_all(any_vars(str_detect(., "WGS"))
. However, trying to negate the code with negate=T
or !str_detect()
return the whole dataframe and nothing seems to work. Using all_vars()
remove every rows in the df.
I came around a solution but I find it quite heavy and I'm pretty sure there is a better way to perform this :
> tmp <- metadata[["PRJNA237362"]]
> no <- tmp %>% filter_all(any_vars(str_detect(., "WGS")))
> final <- tmp[tmp$Run %notin% no$Run,]
I'm not very familiar with the tidyverse, still a lot to learn, so I might have missed something here.
I don't understand why filter
returns the whole df when negating the expression
A reproducible example of what I'm dealing with
> data(msleep)
> msleep%>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()
> msleep%>% filter_all(any_vars(str_detect(., "omni", negate=T))) %>% glimpse()
> no <- msleep %>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()
> yes <- msleep[msleep$vore %notin% no$vore,] %>% glimpse()
Here a part of df I'm actually working on :
> df = structure(list(Run = c("ERR2804817", "ERR2804818", "ERR2804819",
"ERR2804820", "ERR2804821", "ERR2834367", "ERR2834371", "ERR2834373",
"ERR2834374", "ERR2834375", "ERR2834376", "ERR2834377", "ERR2834379",
"ERR2828323", "ERR2828326", "ERR2828327", "ERR2828328", "ERR2828330"
), LibraryLayout = c("PAIRED", "PAIRED", "PAIRED", "PAIRED",
), Library.Name = c("Bangladeshi_2yr", "Bangladeshi_2yr", "Bangladeshi_2yr",
"Bangladeshi_2yr", "Bangladeshi_2yr", "table S7A,B; WGS", "table S7A,B; WGS",
"table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS",
"table S7A,B; WGS", "table S7A,B; WGS", "table S12", "table S12",
"table S12", "table S12", "table S12"), LibrarySource = c("METAGENOMIC",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 73L, 74L, 75L, 76L, 77L,
78L, 79L, 80L, 806L, 807L, 808L, 809L, 810L), class = "data.frame")
> #Here is what I have for now
> `%notin%` = Negate(`%in%`)
> tmp = metadata %>% filter_all(any_vars(everything(), str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> meta= meta[meta$Run%notin%tmp$Run,]
Ultimately, I would like to make something like that :
> tmp = meta %>% filter_all(any_vars(!str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> #OR this version
> tmp = meta %>% filter_all(any_vars(str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS", negate=T)))
The trick is that I can't predict the colnames of my df nor the dimension of my df so I wrote for()
loops with a conditions to detect pattern, remove them and write a new file with the cleaned df.
For now my code is working but I'm sure there is a better way to do it.
> packageVersion("tidyverse")
[1] ‘1.3.0’
> packageVersion("dplyr")
[1] ‘1.0.5’
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rmdformats_1.0.1 ggpubr_0.4.0 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.0 tidyverse_1.3.0 ade4_1.7-16 factoextra_1.0.7
[13] ggplot2_3.3.3 FactoMineR_2.4
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.2 prettydoc_0.4.1
[4] carData_3.0-4 modelr_0.1.8 assertthat_0.2.1
[7] cellranger_1.1.0 yaml_2.2.1 progress_1.2.2
[10] ggrepel_0.9.1 pillar_1.5.1 backports_1.2.1
[13] lattice_0.20-41 glue_1.4.2 digest_0.6.27
[16] ggsignif_0.6.1 rvest_0.3.6 colorspace_2.0-0
[19] cowplot_1.1.1 htmltools_0.5.1.1 pkgconfig_2.0.3
[22] broom_0.7.5 haven_2.3.1 bookdown_0.21
[25] scales_1.1.1 openxlsx_4.2.3 rio_0.5.26
[28] farver_2.1.0 generics_0.1.0 car_3.0-10
[31] ellipsis_0.3.1 DT_0.17 withr_2.4.1
[34] cli_2.3.1 magrittr_2.0.1 crayon_1.4.1
[37] readxl_1.3.1 evaluate_0.14 fs_1.5.0
[40] fansi_0.4.2 MASS_7.3-53.1 rstatix_0.7.0
[43] xml2_1.3.2 foreign_0.8-81 tools_4.0.4
[46] data.table_1.14.0 prettyunits_1.1.1 hms_1.0.0
[49] lifecycle_1.0.0 munsell_0.5.0 reprex_1.0.0
[52] zip_2.1.1 cluster_2.1.1 flashClust_1.01-2
[55] compiler_4.0.4 rlang_0.4.10 grid_4.0.4
[58] rstudioapi_0.13 htmlwidgets_1.5.3 leaps_3.1
[61] labeling_0.4.2 rmarkdown_2.7 gtable_0.3.0
[64] abind_1.4-5 DBI_1.1.1 curl_4.3
[67] R6_2.5.0 lubridate_1.7.10 knitr_1.31
[70] utf8_1.2.1 stringi_1.5.3 Rcpp_1.0.6
[73] vctrs_0.3.6 scatterplot3d_0.3-41 dbplyr_2.1.0
[76] tidyselect_1.1.0 xfun_0.22
Second edit based on updated information:
Another way to approach this is to do a rowwise operation and add a matching column based on your chosen regex matches:
If you want to keep the NA values in your final filter then this should work:
regex_match <- "omni"
msleep %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character),
regex(regex_match)), na.rm = FALSE)) %>%
If you want to exclude the NAs, then add a replace_na() step:
msleep %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), regex("omni")), na.rm = FALSE),
regex_match = replace_na(regex_match, TRUE)) %>%
So the first version with your metadata:
regex_match <- "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS"
metadata %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), regex(regex_match)), na.rm = FALSE)) %>%
Edit 1
I think the problem lies in the combination of a negation with the syntax any_vars
, which means you are returning the whole dataframe because every column has a row with values not containing "omni" or "WGS" from your data.
With the latest version of dplyr syntax, you could try the following:
msleep %>% filter(if_all(starts_with("vore"), ~!str_detect(.x, "omni")))
This focuses on just the one column, or
msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))
for the entire dataframe.
Does that get what you need?
@Marcelo Avila and @awaji98 propositions works on my problem. However, I would like to show that this code as a subtility in the fact that it seems NA are removed with the propositions above:
msleep%>% filter_all(all_vars(str_detect(., "omni", negate=T)))```
msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))
msleep %>%
.cols = everything(),
.fns = ~ stringr::str_detect(.x, "omni", negate = TRUE))
no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
# A tibble: 20 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Owl mon… Aotus omni Prima… NA 17 1.8 NA
2 Greater… Blari… omni Soric… lc 14.9 2.3 0.133
3 Grivet Cerco… omni Prima… lc 10 0.7 NA
4 Star-no… Condy… omni Soric… lc 10.3 2.2 NA
5 African… Crice… omni Roden… NA 8.3 2 NA
6 Lesser … Crypt… omni Soric… lc 9.1 1.4 0.15
7 North A… Didel… omni Didel… lc 18 4.9 0.333
8 Europea… Erina… omni Erina… lc 10.1 3.5 0.283
9 Patas m… Eryth… omni Prima… lc 10.9 1.1 NA
10 Galago Galago omni Prima… NA 9.8 1.1 0.55
11 Human Homo omni Prima… NA 8 1.9 1.5
12 Macaque Macaca omni Prima… NA 10.1 1.2 0.75
13 Chimpan… Pan omni Prima… NA 9.7 1.4 1.42
14 Baboon Papio omni Prima… NA 9.4 1 0.667
15 Potto Perod… omni Prima… lc 11 NA NA
16 African… Rhabd… omni Roden… NA 8.7 NA NA
17 Squirre… Saimi… omni Prima… NA 9.6 1.4 NA
18 Pig Sus omni Artio… domesticated 9.1 2.4 0.5
19 Tenrec Tenrec omni Afros… NA 15.6 2.3 NA
20 Tree sh… Tupaia omni Scand… NA 8.9 2.6 0.233
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
We find 20 rows that contains the pattern "omni"
no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
msleep[msleep$vore %notin% no$vore,]
# A tibble: 63 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cheetah Acino… carni Carni… lc 12.1 NA NA
2 Mountai… Aplod… herbi Roden… nt 14.4 2.4 NA
3 Cow Bos herbi Artio… domesticated 4 0.7 0.667
4 Three-t… Brady… herbi Pilosa NA 14.4 2.2 0.767
5 Norther… Callo… carni Carni… vu 8.7 1.4 0.383
6 Vesper … Calom… NA Roden… NA 7 NA NA
7 Dog Canis carni Carni… domesticated 10.1 2.9 0.333
8 Roe deer Capre… herbi Artio… lc 3 NA NA
9 Goat Capri herbi Artio… lc 5.3 0.6 NA
10 Guinea … Cavis herbi Roden… domesticated 9.4 0.8 0.217
# … with 53 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
# bodywt <dbl>
This remove efficiently the 20 rows and return 63 rows df. However, because of the NA it seems that the following code (and the others above) return a wrong df.
msleep %>%
~stringr::str_detect(., "omni", negate = T)
A tibble: 15 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cow Bos herbi Artio… domesticated 4 0.7 0.667
2 Dog Canis carni Carni… domesticated 10.1 2.9 0.333
3 Guinea … Cavis herbi Roden… domesticated 9.4 0.8 0.217
4 Chinchi… Chinc… herbi Roden… domesticated 12.5 1.5 0.117
5 Long-no… Dasyp… carni Cingu… lc 17.4 3.1 0.383
6 Big bro… Eptes… inse… Chiro… lc 19.7 3.9 0.117
7 Horse Equus herbi Peris… domesticated 2.9 0.6 1
8 Domesti… Felis carni Carni… domesticated 12.5 3.2 0.417
9 Golden … Mesoc… herbi Roden… en 14.3 3.1 0.2
10 House m… Mus herbi Roden… nt 12.5 1.4 0.183
11 Rabbit Oryct… herbi Lagom… domesticated 8.4 0.9 0.417
12 Laborat… Rattus herbi Roden… lc 13 2.4 0.183
13 Eastern… Scalo… inse… Soric… lc 8.4 2.1 0.167
14 Thirtee… Sperm… herbi Roden… lc 13.8 3.4 0.217
15 Brazili… Tapir… herbi Peris… vu 4.4 1 0.9
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
There is something weird when negating the str_detect()
If anyone has an insight on this it would be tremendous as I fear I will have trouble sleeping tonight.
Since you mentioned multiple keywords, you can pass multiple keywords to str_detect()
with the regex |
(or) operator.
The following lines will filter out (via negate = TRUE
all rows where at least one variable has at least one of the given patterns ui|Br|Ch|lis
keywords_to_remove <- c("ui", "Br", "lis", "Ch", "omni")
keywords_regex <- paste0(keywords_to_remove, collapse = "|")
msleep %>%
.cols = everything(),
.fns = ~ stringr::str_detect(.x, keywords_regex, negate = TRUE))
#> # A tibble: 9 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 2 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
#> 3 Long-… Dasyp… carni Cing… lc 17.4 3.1 0.383 6.6
#> 4 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
#> 5 Golde… Mesoc… herbi Rode… en 14.3 3.1 0.2 9.7
#> 6 House… Mus herbi Rode… nt 12.5 1.4 0.183 11.5
#> 7 Rabbit Oryct… herbi Lago… domesticated 8.4 0.9 0.417 15.6
#> 8 Labor… Rattus herbi Rode… lc 13 2.4 0.183 11
#> 9 Easte… Scalo… inse… Sori… lc 8.4 2.1 0.167 15.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
#> [1] '1.0.5'
Created on 2021-03-23 by the reprex package (v1.0.0)
