torakxkz
torakxkz

Reputation: 493

Special symbols pattern search with str_detect

Suppose I have the following df:

library(dplyr)
library(stringr)

input <- data.frame(
Id = c(1:6),
text = c("(714.4) (714) (714*)", "(714.33)", "(189) (1938.23)", "(714.93+) (714*)", "(719)", "(718.4)"))

And I would like to obtain the following output:

Output <- data.frame(
Id = c(1:6),
text = c("(714.4) (714) (714*)", "(714.33)", "(189) (1938.23)",
 "(714.93+) (714*)", "(719) (299)", "(718.4)"),
first_match = c(1,0,0,0,1,0),
second_match = c(1,1,0,1,1,0))

This is, for the first column I want a one if (714)|(719)|(718) appear. For the second column I want a one if (714.33)|(714*)|(719) appear

In cases in which I want to evaluate if a pattern is in a string I use str_detect function from stringr package. However, in this case, with symbols such as [. + *] I am not obtaining the expected output.

I have tried the following code, which obviously failed:

attempt_1 <- input %>%
  mutate(first_match = ifelse(str_detect(text, "(714)|(719)|(718)"), 1, 0), 
         second_match = ifelse(str_detect(text, "(714\\.33)|(714\\*)|(719)"), 1, 0))

attempt_2 <- input %>%
 mutate(first_match = ifelse(str_detect(text, fixed("(714)|(719)")), 1, 0), 
        second_match = ifelse(str_detect(text, "(714\\.33)|(714\\*)"), 1, 0))

I tried to escape special symbols and also tried with exact match with the fixed parameter (I suppose it fails cause the | is not interpreted as an OR)

Any ideas?

Upvotes: 2

Views: 1696

Answers (1)

akrun
akrun

Reputation: 886968

We can escape the (

library(dplyr)
library(stringr)
input %>%
    mutate(first_match = +(str_detect(text, "\\(714\\)|\\(719\\)")),
        second_match = +(str_detect(text, "\\(714\\.33\\)|\\(714\\*\\)|\\(719\\)")))
#   Id                 text first_match second_match
#1  1 (714.4) (714) (714*)           1            1
#2  2             (714.33)           0            1
#3  3      (189) (1938.23)           0            0
#4  4     (714.93+) (714*)           0            1
#5  5                (719)           1            1
#6  6              (718.4)           0            0

Comparing with OP's expected output

Output
#  Id                 text first_match second_match
#1  1 (714.4) (714) (714*)           1            1
#2  2             (714.33)           0            1
#3  3      (189) (1938.23)           0            0
#4  4     (714.93+) (714*)           0            1
#5  5          (719) (299)           1            1
#6  6              (718.4)           0            0

In the OP's code, the first one didn't work because the ( is a metacharacter, and in the second attempt, the | is considered as fixed

Upvotes: 3

Related Questions