done_merson
done_merson

Reputation: 2998

Can't figure out why regex group is not working in str_match

I have the following code with a regex

CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]

I thought this should match the value of "WILL " but it is returning blank. I tried the RegEx in a different language and the group is coming back blank in that instance also.

What do I have to add to this regex to pull back just the value "WILL"?

Upvotes: 1

Views: 47

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627077

You formed a repeated capturing group by placing + outside a group. Put it back:

CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
                          ^

Note you may trim Will if you use a lazy match with \s* after the group:

CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"

See this regex demo.

> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"

Alternatively, you may just extract Will with

> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"

Or the same with base R:

> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"

Here,

  • ^ - matches the start of a string
  • .*? - any 0+ chars other than line break chars as few as possible
  • (?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
    • \\s* - 0+ whitespaces
    • (?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
    • (?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
    • (?:\\(CONT'D\\))? - an optional (CONT'D) substring
    • $ - end of string.

Upvotes: 1

Related Questions