Tim N
Tim N

Reputation: 3

Dplyr filter function needs to be run three times to remove condition

I'm writing a qPCR analysis script in R and I have the strangest issue, when I try to filter out "bad genes", I have to filter three times for all the genes to be removed.

Part of my analysis is determining genes that lack enough data to properly be analyzed and this is done by finding primers that have poor outputs in its technical replicates. To do this I take the inputted xlsx file and do the following:

dat.group$CT <- as.numeric(dat.group$CT)
dat.group$Ct.SD <- as.numeric(dat.group$Ct.SD)

This coerces the non-numeric data (which I consider "bad data") to NAs. I then do the following:

badgenes <- dat.avg$Target.Name[is.na(dat.avg$ct_sd)]
badgenes
[1] "Gad2"  "Pitx3"

With this I try to now remove these genes from my data set as follows (dat.avg has all the same names as dat.group, its just been further processed, but the Target.Name hasn't changed, I could show processing if need be):

sum(dat.avg$Target.Name == badgenes)
dat.filt <- filter(dat.avg, Target.Name != badgenes)
sum(dat.filt$Target.Name == badgenes)
dat.filt <- filter(dat.filt, Target.Name != badgenes)
sum(dat.filt$Target.Name == badgenes)
dat.filt <- filter(dat.filt, Target.Name != badgenes)
sum(dat.filt$Target.Name == badgenes)

The output for this however is:

[1] 4
[1] 2
[1] 2
[1] 0

And using regular R subsetting the same thing happens:

sum(dat.avg$Target.Name == badgenes)
dat.filt<-dat.avg[!(dat.avg$Target.Name == badgenes),]
sum(dat.filt$Target.Name == badgenes)
dat.filt<-dat.filt[!(dat.filt$Target.Name == badgenes),]
sum(dat.filt$Target.Name == badgenes)
dat.filt<-dat.filt[!(dat.filt$Target.Name == badgenes),]
sum(dat.filt$Target.Name == badgenes)

Giving:

[1] 4
[1] 2
[1] 2
[1] 0

I know that just by filtering multiple times the issue is "fixed", but I want to know why this issue is even happening as it doesn't seem to make much sense to me.

Thanks in advance!

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2    xlsx_0.5.7      xlsxjars_0.6.1  rJava_0.9-9     forcats_0.2.0   stringr_1.2.0   dplyr_0.7.4     purrr_0.2.4     readr_1.1.1     tidyr_0.7.2     tibble_1.3.4   
[12] ggplot2_2.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] reshape2_1.4.3   haven_1.1.0      lattice_0.20-35  colorspace_1.3-2 htmltools_0.3.6  yaml_2.1.16      rlang_0.1.4      foreign_0.8-69   glue_1.2.0       modelr_0.1.1    
[11] readxl_1.0.0     bindr_0.1        plyr_1.8.4       munsell_0.4.3    gtable_0.2.0     cellranger_1.1.0 rvest_0.3.2      evaluate_0.10.1  psych_1.7.8      labeling_0.3    
[21] knitr_1.20       parallel_3.4.1   broom_0.4.3      Rcpp_0.12.14     backports_1.1.2  scales_0.5.0     jsonlite_1.5     mnormt_1.5-5     hms_0.4.0        digest_0.6.13   
[31] stringi_1.1.6    grid_3.4.1       rprojroot_1.2    cli_1.0.0        tools_3.4.1      magrittr_1.5     lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.1  xml2_1.1.1      
[41] lubridate_1.7.1  assertthat_0.2.0 rmarkdown_1.9    httr_1.3.1       rstudioapi_0.7   R6_2.2.2         nlme_3.1-131     compiler_3.4.1  

Upvotes: 0

Views: 62

Answers (2)

anon
anon

Reputation:

Building on Seymour's answer, if you do this sort of thing a lot, you could create a custom %!in% function and use that to filter with.

`%!in%` <- Negate(`%in%`)
dat.filt <- filter(dat.avg, Target.Name %!in% badgenes)

Upvotes: 2

Seymour
Seymour

Reputation: 3264

It would be nice if you share a minimal reproducible example.

However, the trick is given by %in%:

dat.filt <- filter(dat.avg, !(Target.Name %in% badgenes))

Considering you only want to keep those elements that are NOT in vector badgenes, you simply put the ! before the parenthesis: !(Target.Name %in% badgenes)

Upvotes: 0

Related Questions