reubenmcg
reubenmcg

Reputation: 371

how do I filter using dplyr for list of character strings ignoring numbers after a decimal place

I have the following column in a data.frame called "id":

   example_0test0|EMM3.71|NTERM| 
   example_0test1|_EMM92.2|CTERM| 
   example_0test2|_EMM92.2|NTERM| 
   example_0test0|EMM1|NTERM| 
   example_0test0|EMM100|NTERM| 
   example_0test0|EMM1.11|NTERM| 
   example_0test0|EMM1.123|NTERM| 

I would like to filter using dplyr filter function for a list of exact characters such as (keeping it simple): EMM1 to give the following output:

> test_df2
                              id col1 col2
1     example_0test0|EMM1|NTERM| 10.4 exp4
2  example_0test0|EMM1.11|NTERM| 10.3 exp6
3 example_0test0|EMM1.123|NTERM| 10.3 exp7

I have a factor list of characters saved like this I would like to use as input for the filtering terms:

"EMM1|EMM101|EMM103|EMM104|EMM108.1|EMM11|EMM113|EMM114|EMM116.1|EMM118|EMM12|EMM123|EMM19.4|EMM197|EMM2|"

I have tried using a combination using "filter" and "str_detect" which kind of worked HOWEVER if my search list includes "EMM1" I would like to include things like "EMM1.0" or "EMM1.1"

I suspect as each of my EMM terms in the column to filter by is encased in "|" like "text|EMM1.0|text" there might be a way to use this for the filtering?

here is a mini example of the type of data.frame I am working with:

> dput(test_df)
structure(list(id = c("example_0test0|EMM3.71|NTERM|", "example_0test1|_EMM92.2|CTERM|", 
"example_0test2|_EMM92.2|NTERM|", "example_0test0|EMM1|NTERM|", 
"example_0test0|EMM100|NTERM|", "example_0test0|EMM1.11|NTERM|", 
"example_0test0|EMM1.123|NTERM|"), col1 = c(10.1, 10.2, 10.3, 
10.4, 10.3, 10.3, 10.3), col2 = c("exp1", "exp2", "exp3", "exp4", 
"exp5", "exp6", "exp7")), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 1

Views: 918

Answers (3)

Mike V
Mike V

Reputation: 1364

Or you can use base R approach

 df[grepl("EMM1(\\.\\d{1,})|EMM1\\|", df$id),]
 #                          id col1 col2
 # 4     example_0test0|EMM1|NTERM| 10.4 exp4
 # 6  example_0test0|EMM1.11|NTERM| 10.3 exp6
 # 7 example_0test0|EMM1.123|NTERM| 10.3 exp7
  • EMM1(\\.\\d{1,}): capture group EMM1 with dot and digits

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389205

You can use :

pat <- "EMM1|EMM101|EMM103|EMM104|EMM108.1|EMM11|EMM113|EMM114|EMM116.1|EMM118|EMM12|EMM123|EMM19.4|EMM197|EMM2"
subset(test_df, grepl(sprintf('(%s)(\\.|\\|)', pat), id))


#                              id col1 col2
#4     example_0test0|EMM1|NTERM| 10.4 exp4
#6  example_0test0|EMM1.11|NTERM| 10.3 exp6
#7 example_0test0|EMM1.123|NTERM| 10.3 exp7

pat consists of all the "EMM" values that we want additionally we create a pattern using sprintf returning only those values which has a "." or "|" after those pat values.


We can also use this with filter and str_detect similarly.

library(dplyr)
library(stringr)

test_df %>% filter(str_detect(id, sprintf('(%s)(\\.|\\|)', pat)))

Upvotes: 1

akrun
akrun

Reputation: 887711

We can use str_detect

library(dplyr)
library(stringr)
test_df %>% 
    filter(str_detect(id, "EMM1\\||(EMM1\\.\\d+)"))
#                              id col1 col2
#1     example_0test0|EMM1|NTERM| 10.4 exp4
#2  example_0test0|EMM1.11|NTERM| 10.3 exp6
#3 example_0test0|EMM1.123|NTERM| 10.3 exp7

If we are filtering based on a column from another table, we could remove the . and the digits that follow

patvec <- sub("\\.\\d+$", "", df2$id)
i1 <- Reduce(`|`, lapply(paste0(patvec, "\\||(", patvec, "\\.\\d+)"),
      function(pat),
         grepl(pat, test_df$id)))
subset(test_df, i1)

Upvotes: 1

Related Questions