user2716722
user2716722

Reputation: 93

Running regex in R using str_extract_all has regexp not yet implemented

I am trying to use regex to parse a file using regex. Most of the solutions to using regex in R use the stringr package. I have not found another way, or another package to use that would work. If you have another way of going about this that would also be acceptable.

What I am trying to accomplish is to grab a couple of values that are seperated by spaces with the last value being some comma seperated values of variable length. This should go into a matrix or df in table like format is it is currently.

foo     foo_123bar      foo,bar,bazz
foo2    foo_456bar      foo2,bar2

I have the working example of my regex here.

There could be a couple of issues I could be running into. The first could be that the regex I am writing is not supported by R's regex engine. Although I have the feeling from this that would be supported. I have seen that R uses a POSIX like format which could make things interesting. The second simply could be exactly what the error message bellow is showing. This is not a feature that has been coded in yet. This however would be the most troubling because I don't know another way to solve my problem without this package.

Below is the R code that I am using to replicate this error

library("stringr")

string = " foo  foo_123bar      foo,bar,bazz\n  foo2    foo_456bar      foo2,bar2,bazz2"

pattern = "
  (?(DEFINE)
    (?<blanks>[[:blank:]]+)
    (?<var>\"?[[:alnum:]_]+\"?)
    (?<csvar>(\"?[[:alnum:]_]+\"?,?)+)
  )
  ^
    (?&blanks)((?&var))
    (?&blanks)((?&var))
    (?&blanks)((?&csvar))"

# Both of these are throwing the error
str_extract_all(string, pattern)
str_extract_all(string, regex(pattern, multiline=TRUE, comments=TRUE))

> Error in stri_extract_all_regex(string, pattern, simplify = simplify,  : 
> Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)


# Using the example from ?str_extract_all runs without error
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)

I am looking for a solution, not necessarily a stringr solution, but this is the only way I found that fits my needs. The other simpler R regex functions only accept the pattern and not the extra parameters that include the multi line and comment functionality that I am using.

Upvotes: 2

Views: 907

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You have a PCRE regex that can only be used in methods/functions that parse the regex with the PCRE regex library (or Boost, it is based on PCRE). stringr str_extract parses the regex with the ICU regex library. ICU regex does not support recursion and DEFINE block. You just can't use the in-pattern approach to define subpatterns and then re-use them.

Instead, just declare the regex parts you need to re-use as variables and build the pattern dynamically:

library("stringr")
string = " foo  foo_123bar      foo,bar,bazz\n  foo2    foo_456bar      foo2,bar2,bazz2"
blanks <- "[[:blank:]]+"
vars <- "\"?[[:alnum:]_]+\"?"
csvar <- "(?:\"?[[:alnum:]_]+\"?,?)+"
pattern <- paste0("^",blanks,"(", vars, ")",blanks,"(", vars,")",blanks,"(",csvar, ")")
str_match_all(string, pattern)
# [[1]]
#     [,1]                                 [,2]  [,3]         [,4]          
#[1,] " foo  foo_123bar      foo,bar,bazz" "foo" "foo_123bar" "foo,bar,bazz"

Note: you need to use str_match (or str_match_all) to extract the capturing group values as str_extract or str_extract_all only allows access to the whole match values.

Upvotes: 3

Related Questions