How to properly use a regex statement in R's stringr

Question

How would I extract a specific character, using stringr, based on a specific pattern.

For example, if I have the following coefficient in a tidy model table:

I(pmax(0, hp - 100))

I want to create two additional columns with hp and 100.

Example code:

library(tidyverse)
library(broom)
library(stringr)

 #pull in and gather data

mtcars1 <- as_tibble(mtcars)
mtcars1$cyl <- as.factor(mtcars$cyl)
#run model and produce model-summary table
model <- glm(mpg ~ cyl + hp + I(pmax(0, hp - 100)), data = mtcars1)

model_summary <- tidy(model)

How would I extract a specific character, using stringr, based on a specific pattern.

For example, if I have the following coefficient in a tidy model table:

I(pmax(0, hp - 100))

I want to create two additional columns with hp and 100.

I've tried the following that works (specific regex statement) on regex101.com, but not in r.

model_summary_hp <- model_summary %>%
  mutate(term1 = str_extract(term, regex("\I$pmax\(0, ([a-z]+)\ - 100$\)")),
     knot =  str_extract(term, regex("\I$pmax\(0, [a-z]+ - ([0-9]+)$\)")))

I get the following error:

Error: '\I' is an unrecognized escape in character string starting ""\I"

I'm not sure why it doesn't recognize the regex statement.

Wiktor Stribiżew · Accepted Answer

One very important thing is to understand how to use a regex online tester: if you see something there, it does not mean it will work the same in your target environment. Since you are using stringr functions, you must make sure your patterns are ICU engine compatible while regex101 only supports PCRE, JS, Python re and Go regex engines. Mind that if you use (g)sub you must make sure the regex is compatible with the TRE regex engine or PCRE (when adding perl=TRUE).

Now, you need to extract 2 values, and that means you need to use 2 str_extract or sub calls.

A stringr approach:

1) "(?<=I$pmax\(0, )[a-z]+"          # or
   "(?<=I\(pmax\(0,\s{0,10})[a-z]+"

2) "\d+(?=$\))"

Here, the main points are lookarounds: (?<=I$pmax$0, ) matches I(pmax(0, immediately to the left of the current location, but does not put the matched text into the match value. The (?=$$) pattern is a positive lookahead that requires the presence of )) immediately to the right of the current location.

Note that the second version of the first regex will not work at regex101.com since the lookbehind pattern is constrained-width here, not fixed-width.

A sub approach (TRE regex):

1) sub("I$pmax\(\d+,\s*([a-z]+)\s*-\s*\d+$\)","\1", term)

2) sub("I$pmax\(\d+,\s*[a-z]+\s*-\s*(\d+)$\)","\1", term)

Here, the point is to match the whole string, capture what you need, and replace with the placeholder to this group, \1.

How to properly use a regex statement in R's stringr

Answers (1)

Related Questions

How to properly use a regex statement in R&#39;s stringr

Answers (1)

Related Questions

How to properly use a regex statement in R's stringr