exl
exl

Reputation: 1853

Lookaround lookbefore regex for R

I am trying to use regular expressions using the stringr package to extract some text. For some reason, I'm getting and 'Invalid regexp' error. I have tried the regex expression in some website test tools, and it seems to work there. I was wondering if there is something unique about how regex works in R and particularly in the stringr package.

Here is an example:

string <- c("MARKETING:  Vice President", "FINANCE:  Accountant I",
"OPERATIONS: Plant Manager")

pattern <- "[A-Z]+(?=:)"
test <- gsub(" ","",string)
results <- str_extract(test, pattern)

This doesn't seems to be working. I would like to get "MARKETING", "FINANCE", and "OPERATIONS" without the ":" in them. That is why I"m using the lookahead syntax. I realize that I can just work around this using:

pattern <- "[A-Z]+(:)"
test <- gsub(" ","",string)
results <- gsub(":","",str_extract(test, pattern))

But I anticipate that I might need to use lookarounds for more complex situations than this in the near future.

Do I need to amend the regex with some escapes or something to make this work?

Upvotes: 6

Views: 2513

Answers (2)

Matthew Plourde
Matthew Plourde

Reputation: 44614

Lookahead assertions require you to identify the regular expression as a perl regular expression in R.

str_extract(string, perl(pattern))
# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

You can also do this easily in base R:

regmatches(string, regexpr(pattern, string, perl=TRUE))
# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

regexpr finds the matches and regmatches use the match data to extract the substrings.

Upvotes: 6

Justin
Justin

Reputation: 43255

You can do this directly with sub and grouping.

sub('^([A-Z]+):.*$', '\\1', string)

# [1] "MARKETING"  "FINANCE"    "OPERATIONS"

Where I am fixing the group to the start of a line, looking for one or more capital letters and saving them. They must be followed by a colon, : and then zero or more additional characters.

Upvotes: 2

Related Questions