Reputation: 93
I am trying to use regex to parse a file using regex. Most of the solutions to using regex in R use the stringr package. I have not found another way, or another package to use that would work. If you have another way of going about this that would also be acceptable.
What I am trying to accomplish is to grab a couple of values that are seperated by spaces with the last value being some comma seperated values of variable length. This should go into a matrix or df in table like format is it is currently.
foo foo_123bar foo,bar,bazz
foo2 foo_456bar foo2,bar2
I have the working example of my regex here.
There could be a couple of issues I could be running into. The first could be that the regex I am writing is not supported by R's regex engine. Although I have the feeling from this that would be supported. I have seen that R uses a POSIX like format which could make things interesting. The second simply could be exactly what the error message bellow is showing. This is not a feature that has been coded in yet. This however would be the most troubling because I don't know another way to solve my problem without this package.
Below is the R code that I am using to replicate this error
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
pattern = "
(?(DEFINE)
(?<blanks>[[:blank:]]+)
(?<var>\"?[[:alnum:]_]+\"?)
(?<csvar>(\"?[[:alnum:]_]+\"?,?)+)
)
^
(?&blanks)((?&var))
(?&blanks)((?&var))
(?&blanks)((?&csvar))"
# Both of these are throwing the error
str_extract_all(string, pattern)
str_extract_all(string, regex(pattern, multiline=TRUE, comments=TRUE))
> Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
> Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
# Using the example from ?str_extract_all runs without error
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
I am looking for a solution, not necessarily a stringr solution, but this is the only way I found that fits my needs. The other simpler R regex functions only accept the pattern and not the extra parameters that include the multi line and comment functionality that I am using.
Upvotes: 2
Views: 907
Reputation: 626926
You have a PCRE regex that can only be used in methods/functions that parse the regex with the PCRE regex library (or Boost, it is based on PCRE). stringr str_extract
parses the regex with the ICU regex library. ICU regex does not support recursion and DEFINE
block. You just can't use the in-pattern approach to define subpatterns and then re-use them.
Instead, just declare the regex parts you need to re-use as variables and build the pattern dynamically:
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
blanks <- "[[:blank:]]+"
vars <- "\"?[[:alnum:]_]+\"?"
csvar <- "(?:\"?[[:alnum:]_]+\"?,?)+"
pattern <- paste0("^",blanks,"(", vars, ")",blanks,"(", vars,")",blanks,"(",csvar, ")")
str_match_all(string, pattern)
# [[1]]
# [,1] [,2] [,3] [,4]
#[1,] " foo foo_123bar foo,bar,bazz" "foo" "foo_123bar" "foo,bar,bazz"
Note: you need to use str_match
(or str_match_all
) to extract the capturing group values as str_extract
or str_extract_all
only allows access to the whole match values.
Upvotes: 3