Discrepancy between stringr and base R regular expressions

Question

I have this rather annoying to read regular expression.

pattern = "(?<=(?<=[0-9])[dD](?=[0-9]))[0-9]+"

It was generated automatically so human readability or efficiency is less of an issue than validity. It was meant to parse RPG dice type syntax, such as 10d20. Specifically it is supposed to match the 20.

If I use the old method of string matching in R

text = '10d20'
regmatches(text,regexpr(pattern,text,perl = TRUE))

I get what I want, which is 20, however using the more modern method of string matching

stringr::str_match(text,  pattern)

I get nothing. I was wondering what causes this difference between the two methods and how can I avoid issues like this in the future.

hrbrmstr · Accepted Answer

Unless you need the extras that come with ICU (via stringi which stringr is merely a crutch helper wrapper for) there's no need for woe.

In fact, there's a pkg with less marketing power than tidyverse-based pkgs called stringb which puts "data first" (like string[ir]) and relieves you from base regexp inanity. Vis-a-vis:

library(stringb)

pattern <- "(?<=(?<=[0-9])[dD](?=[0-9]))[0-9]+"

text <- '10d20'

text_extract(text, pattern, perl = TRUE)
## [1] "20"

You get saner syntax without relying on a massive compiled code dependencies and 1-away^* stringr abstraction. Bellisimo!

^{* TBFair: the stringb package also has 1-away abstraction from base R functions but the saner syntax makes up for it IMO (unlike stringr).}

Discrepancy between stringr and base R regular expressions

Answers (1)

Related Questions