Reputation: 45
I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:
MEDIUM /REGULAR INSEAM
XX LARGE /SHORT INSEAM
SMALL /32" INSM
X LARGE /30" INSM
I have to capture two things: the value before the /
as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM
or the INSEAM
part.
The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM
and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM]
.
The part ([A-Z]\w+)
only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the /
character? Or is there a better way to do it?
Thanks in advance!
Upvotes: 1
Views: 1567
Reputation: 28441
From your description, Wiktor's regex will fail on "XX LARGE/SHORT"
due to the extra space. It is safer to capture everything before the forward slash as a group:
sub("^(.*/\\w+).*", "\\1", x)
#[1] "MEDIUM /REGULAR" "XX LARGE /SHORT" "SMALL /32" "X LARGE /30"
Upvotes: 2
Reputation: 626794
It seems you can use
(\w+(?: \w+)?) */ *(\w+)
See the regex demo
Pattern details:
(\w+(?: \w+)?)
- Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars */ *
- a /
enclosed with 0+ spaces(\w+)
- Group 2 capturing 1 or more word charsR code with stringr
:
> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
[,1] [,2] [,3]
[1,] "MEDIUM /REGULAR" "MEDIUM" "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"
[3,] "SMALL /32" "SMALL" "32"
[4,] "X LARGE /30" "X LARGE" "30"
Upvotes: 1