Reputation: 31171
I need a regex
expert on this problem. It's linked to a SO question I've lost, where the data are the following:
x = c("IID:WE:G12D/V/A", "GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
I need to split each element of x
to isolate the end part which can be consituted of letter - slash - letter - .... slash - letter
.
What I want is to obtain these two vectors as output:
o1 = c("IID:WE:G12", "GH:SQ:p.R172", "HH:WG:p.S122")
o2 = c("D/V/A", "W/G", "F/H")
I have this solution for o1
:
gsub('[A-Z]/.+','',x)
#[1] "IID:WE:G12" "GH:SQ:p.R172" "HH:WG:p.S122"
Good. For o2
, I tried to use assertion and particularly look-ahead assertion:
gsub('.+(?=[A-Z]/.+)','',x, perl=T)
#[1] "V/A" "W/G" "F/H"
But this is not the wanted result!
Any idea what is going wrong with the second regex?
Upvotes: 2
Views: 101
Reputation: 4554
Try this:
gsub('\\w\\/.*(\\/.*)?','',x)
Regex look ahead:
gsub('\\w(?=\\/).*','',x,perl=T)
gsub('.*\\d(?=\\w\\/)','',x, perl=T) #For O2
Upvotes: 1
Reputation: 24074
The following, very near to what you came up with, will work:
gsub('[^/]+(?=[A-Z]/.+)','',x, perl=T)
(Your line didn't work because you were asking for "any character", which includes "\")
Upvotes: 3
Reputation: 626936
As a possible solution, you can use the following replacement:
gsub('.*?([^/](?:/[^/])+)$','\\1',x, perl=T)
Or (if there must be a letter):
gsub('.*?([A-Z](?:/[A-Z])+)$','\\1',x, perl=T)
See IDEONE demo
.*?
- matches as few as possible characters other than a newline from the start([^/](?:/[^/])+)
- a capturing group matching:
[^/]
- a character other than /
(or - if [A-Z]
- any English uppercase character)(?:/[^/])+
- 1 or more sequences of /
and a character other than /
(or if you use [A-Z]
, an uppercase letter).$
- end of stringUpvotes: 3