danbo_c
danbo_c

Reputation: 23

Parse last part of string with certain requirements in R

I want to subset a data frame on the last part of a string but my regex skills in R are lacking. Here is the problem im running into. I have a column that looks like this:

EM1234 > COMJ1234 > ADW1234
ADW1234 > COMJ1234 > EM1234
EM4321 > COMJ1234 > EM1234
COMJEM > ADW1234 > MSNK123
COMJ12 > ADW1234 > EMP1234

I only want to subset data that ENDS with EM and not EMP. I also run into the last problem with the first COMJEM. As using regex for any character would include that example. Here is what I am currently using but does not work:

sources <- data.frame(column = I(c('EM1234 > COMJ1234 > ADW1234',
                                   'ADW1234 > COMJ1234 > EM1234',
                                   'EM4321 > COMJ1234 > EM1234',
                                   'COMJEM > ADW1234 > MSNK123',
                                   'COMJ12 > ADW1234 > EMP1234')))
subset <- sources[grep("^'.+EM[[:alnum:]]{2,8}'$", sources$column),]

What is a better way to write this regex? The answer should result subsetting to look like this:

ADW1234 > COMJ1234> EM1234
EM4321 > COMJ1234> EM1234

Upvotes: 2

Views: 54

Answers (2)

hwnd
hwnd

Reputation: 70750

You can use a word boundary \b and match at the end of the string:

sources[grep('\\bEM[^P]\\S+$', sources$column),]
# [1] "ADW1234 > COMJ1234 > EM1234" "EM4321 > COMJ1234 > EM1234"

Upvotes: 1

Jthorpe
Jthorpe

Reputation: 10203

You want to use \\d or [:digit:], since [:alnum:] matches all alphanumeric characters (i.e. [:alpha:] and [:digit:]). Also I think you want to drop the single quotes in your string, as in:

"^.+EM\\d{2,8}$"

or

"^.+EM[[:digit:]]{2,8}$"

Upvotes: 0

Related Questions