Regular expression: matching multiple words

Question

I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:

MEDIUM /REGULAR INSEAM

XX LARGE /SHORT INSEAM

SMALL /32" INSM

X LARGE /30" INSM

I have to capture two things: the value before the / as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM or the INSEAM part.

The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM]. The part ([A-Z]\w+) only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the / character? Or is there a better way to do it?

Thanks in advance!

Wiktor Stribiżew · Accepted Answer

It seems you can use

(\w+(?: \w+)?) */ *(\w+)

See the regex demo

Pattern details:

(\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
*/ * - a / enclosed with 0+ spaces
(\w+) - Group 2 capturing 1 or more word chars

R code with stringr:

> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32" INSM", "X LARGE /30" INSM")
> str_match(v, "(\w+(?: \w+)?) */ *(\w+)")
     [,1]              [,2]       [,3]     
[1,] "MEDIUM /REGULAR" "MEDIUM"   "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"  
[3,] "SMALL /32"       "SMALL"    "32"     
[4,] "X LARGE /30"     "X LARGE"  "30"

Regular expression: matching multiple words

Answers (2)

Related Questions