gagandeep91
gagandeep91

Reputation: 45

Regular expression: matching multiple words

I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:

MEDIUM /REGULAR INSEAM

XX LARGE /SHORT INSEAM

SMALL /32" INSM

X LARGE /30" INSM

I have to capture two things: the value before the / as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM or the INSEAM part.

The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM]. The part ([A-Z]\w+) only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the / character? Or is there a better way to do it?

Thanks in advance!

Upvotes: 1

Views: 1567

Answers (2)

Pierre L
Pierre L

Reputation: 28441

From your description, Wiktor's regex will fail on "XX LARGE/SHORT" due to the extra space. It is safer to capture everything before the forward slash as a group:

sub("^(.*/\\w+).*", "\\1", x)
#[1] "MEDIUM /REGULAR"  "XX  LARGE /SHORT" "SMALL /32" "X LARGE /30"  

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

It seems you can use

(\w+(?: \w+)?) */ *(\w+)

See the regex demo

Pattern details:

  • (\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
  • */ * - a / enclosed with 0+ spaces
  • (\w+) - Group 2 capturing 1 or more word chars

R code with stringr:

> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
     [,1]              [,2]       [,3]     
[1,] "MEDIUM /REGULAR" "MEDIUM"   "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"  
[3,] "SMALL /32"       "SMALL"    "32"     
[4,] "X LARGE /30"     "X LARGE"  "30"     

Upvotes: 1

Related Questions