Reputation: 23
I want to subset a data frame on the last part of a string but my regex skills in R are lacking. Here is the problem im running into. I have a column that looks like this:
EM1234 > COMJ1234 > ADW1234
ADW1234 > COMJ1234 > EM1234
EM4321 > COMJ1234 > EM1234
COMJEM > ADW1234 > MSNK123
COMJ12 > ADW1234 > EMP1234
I only want to subset data that ENDS with EM and not EMP. I also run into the last problem with the first COMJEM. As using regex for any character would include that example. Here is what I am currently using but does not work:
sources <- data.frame(column = I(c('EM1234 > COMJ1234 > ADW1234',
'ADW1234 > COMJ1234 > EM1234',
'EM4321 > COMJ1234 > EM1234',
'COMJEM > ADW1234 > MSNK123',
'COMJ12 > ADW1234 > EMP1234')))
subset <- sources[grep("^'.+EM[[:alnum:]]{2,8}'$", sources$column),]
What is a better way to write this regex? The answer should result subsetting to look like this:
ADW1234 > COMJ1234> EM1234
EM4321 > COMJ1234> EM1234
Upvotes: 2
Views: 54
Reputation: 70750
You can use a word boundary \b
and match at the end of the string:
sources[grep('\\bEM[^P]\\S+$', sources$column),]
# [1] "ADW1234 > COMJ1234 > EM1234" "EM4321 > COMJ1234 > EM1234"
Upvotes: 1
Reputation: 10203
You want to use \\d
or [:digit:]
, since [:alnum:]
matches all alphanumeric characters (i.e. [:alpha:]
and [:digit:]
). Also I think you want to drop the single quotes in your string, as in:
"^.+EM\\d{2,8}$"
or
"^.+EM[[:digit:]]{2,8}$"
Upvotes: 0