Reputation: 1853
I am trying to use regular expressions using the stringr package to extract some text. For some reason, I'm getting and 'Invalid regexp' error. I have tried the regex expression in some website test tools, and it seems to work there. I was wondering if there is something unique about how regex works in R and particularly in the stringr package.
Here is an example:
string <- c("MARKETING: Vice President", "FINANCE: Accountant I",
"OPERATIONS: Plant Manager")
pattern <- "[A-Z]+(?=:)"
test <- gsub(" ","",string)
results <- str_extract(test, pattern)
This doesn't seems to be working. I would like to get "MARKETING", "FINANCE", and "OPERATIONS" without the ":" in them. That is why I"m using the lookahead syntax. I realize that I can just work around this using:
pattern <- "[A-Z]+(:)"
test <- gsub(" ","",string)
results <- gsub(":","",str_extract(test, pattern))
But I anticipate that I might need to use lookarounds for more complex situations than this in the near future.
Do I need to amend the regex with some escapes or something to make this work?
Upvotes: 6
Views: 2513
Reputation: 44614
Lookahead assertions require you to identify the regular expression as a perl regular expression in R.
str_extract(string, perl(pattern))
# [1] "MARKETING" "FINANCE" "OPERATIONS"
You can also do this easily in base R:
regmatches(string, regexpr(pattern, string, perl=TRUE))
# [1] "MARKETING" "FINANCE" "OPERATIONS"
regexpr
finds the matches and regmatches
use the match data to extract the substrings.
Upvotes: 6
Reputation: 43255
You can do this directly with sub
and grouping.
sub('^([A-Z]+):.*$', '\\1', string)
# [1] "MARKETING" "FINANCE" "OPERATIONS"
Where I am fixing the group to the start of a line, looking for one or more capital letters and saving them. They must be followed by a colon, :
and then zero or more additional characters.
Upvotes: 2