Parse last part of string with certain requirements in R

Question

I want to subset a data frame on the last part of a string but my regex skills in R are lacking. Here is the problem im running into. I have a column that looks like this:

EM1234 > COMJ1234 > ADW1234
ADW1234 > COMJ1234 > EM1234
EM4321 > COMJ1234 > EM1234
COMJEM > ADW1234 > MSNK123
COMJ12 > ADW1234 > EMP1234

I only want to subset data that ENDS with EM and not EMP. I also run into the last problem with the first COMJEM. As using regex for any character would include that example. Here is what I am currently using but does not work:

sources <- data.frame(column = I(c('EM1234 > COMJ1234 > ADW1234',
                                   'ADW1234 > COMJ1234 > EM1234',
                                   'EM4321 > COMJ1234 > EM1234',
                                   'COMJEM > ADW1234 > MSNK123',
                                   'COMJ12 > ADW1234 > EMP1234')))
subset <- sources[grep("^'.+EM[[:alnum:]]{2,8}'$", sources$column),]

What is a better way to write this regex? The answer should result subsetting to look like this:

ADW1234 > COMJ1234> EM1234
EM4321 > COMJ1234> EM1234

hwnd · Accepted Answer

You can use a word boundary \b and match at the end of the string:

sources[grep('\bEM[^P]\S+$', sources$column),]
# [1] "ADW1234 > COMJ1234 > EM1234" "EM4321 > COMJ1234 > EM1234"

Parse last part of string with certain requirements in R

Answers (2)

Related Questions