pdubois
pdubois

Reputation: 7800

How to make R regex capture special character (e.g. dot (.) and underscore ( _ ) )?

I have three strings:

x <- "PB0038.1_Jundm2_1/Jaspar.instid_chr1:183286850-183287250.bin1"
y <- "Ddit3::Cebpa/MA0019.1/Jaspar.instid_chr1:183286845-183287245.bin22"
z <- "Arid3a/MA0151.1/Jaspar.instid_chr1:183286849-183287249.bin10"

The regex

^(.*?)\\/.*?\\/.*?\\.instid_(.*?)\\.bin(\\d+)

Works fine for string y, z but not x.

> stringr::str_match(y,"^(.*?)\\/.*?\\/.*?\\.instid_(.*?)\\.bin(\\d+)")[,c(2,3,4)]
[1] "Ddit3::Cebpa"             "chr1:183286845-183287245" "22"                      
> stringr::str_match(z,"^(.*?)\\/.*?\\/.*?\\.instid_(.*?)\\.bin(\\d+)")[,c(2,3,4)]
[1] "Arid3a"                   "chr1:183286849-183287249" "10"  
> stringr::str_match(x,"^(.*?)\\/.*?\\/.*?\\.instid_(.*?)\\.bin(\\d+)")[,c(2,3,4)]
[1] NA NA NA

How can I modify it?

The desired end result for x is

"PB0038.1_Jundm2_1",  "chr1:183286850-183287250" "1"

Upvotes: 1

Views: 39

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522161

Your x input does not and should not match, because it only has one forward slash but your pattern expects two. If you want to allow either one or two forward slashes then one possible modification to your pattern is the following:

str_match(x, "^(.*?)\\/.*?\\.instid_(.*?)\\.bin(\\d+)")[,c(2,3,4)]

You might find the above pattern acceptable because you are only capturing what comes before the first slash. The other two captures happen after the .instid_ token and at the very end after the bin extension. But these would all seem to not depend on the number of slashes in the path.

Demo

Upvotes: 2

Related Questions