Colonel Beauvel
Colonel Beauvel

Reputation: 31171

Regex look ahead assertion

I need a regex expert on this problem. It's linked to a SO question I've lost, where the data are the following:

x = c("IID:WE:G12D/V/A", "GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")

I need to split each element of x to isolate the end part which can be consituted of letter - slash - letter - .... slash - letter. What I want is to obtain these two vectors as output:

o1 = c("IID:WE:G12", "GH:SQ:p.R172", "HH:WG:p.S122")
o2 = c("D/V/A", "W/G", "F/H")

I have this solution for o1:

gsub('[A-Z]/.+','',x)
#[1] "IID:WE:G12"   "GH:SQ:p.R172" "HH:WG:p.S122"

Good. For o2, I tried to use assertion and particularly look-ahead assertion:

gsub('.+(?=[A-Z]/.+)','',x, perl=T)
#[1] "V/A" "W/G" "F/H"

But this is not the wanted result!

Any idea what is going wrong with the second regex?

Upvotes: 2

Views: 101

Answers (3)

Shenglin Chen
Shenglin Chen

Reputation: 4554

Try this:

gsub('\\w\\/.*(\\/.*)?','',x)

Regex look ahead:

gsub('\\w(?=\\/).*','',x,perl=T)

gsub('.*\\d(?=\\w\\/)','',x, perl=T)  #For O2

Upvotes: 1

Cath
Cath

Reputation: 24074

The following, very near to what you came up with, will work:

gsub('[^/]+(?=[A-Z]/.+)','',x, perl=T)

(Your line didn't work because you were asking for "any character", which includes "\")

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626936

As a possible solution, you can use the following replacement:

gsub('.*?([^/](?:/[^/])+)$','\\1',x, perl=T)

Or (if there must be a letter):

gsub('.*?([A-Z](?:/[A-Z])+)$','\\1',x, perl=T)

See IDEONE demo

  • .*? - matches as few as possible characters other than a newline from the start
  • ([^/](?:/[^/])+) - a capturing group matching:
    • [^/] - a character other than / (or - if [A-Z] - any English uppercase character)
    • (?:/[^/])+ - 1 or more sequences of / and a character other than / (or if you use [A-Z], an uppercase letter).
  • $ - end of string

Upvotes: 3

Related Questions