stas g
stas g

Reputation: 1523

extracting multiple overlapping substrings

i have strings of amino-acids like this:

x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"

and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:

#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"

as the output. predictably regexpr gives me the greedy solution:

  regmatches(x, regexpr("M.+\\*", x))
 #[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"

i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.

any help would be appreciated.

Upvotes: 0

Views: 111

Answers (3)

Pierre L
Pierre L

Reputation: 28461

I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:

regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"               
#[2] "MQLPSSFAALAAQFDQL*"            
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"

Upvotes: 3

nrussell
nrussell

Reputation: 18612

Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:

R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"                
#"MQLPSSFAALAAQFDQL*"             
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"

Upvotes: 3

vks
vks

Reputation: 67988

M[^*]+\\*

use negated character class.See demo.Also use perl=True option.

https://regex101.com/r/tD0dU9/6

Upvotes: 1

Related Questions