UOVONUOVO
UOVONUOVO

Reputation: 33

Find nested substrings that match a pattern in a string

I need to find all the substring within this string 'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM' that start with the character M and end with the character *.

I tried to use str_extract_all() and stri_extract_all() but I can't get the result I want:

aa <- 'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM'

str_extract_all(aa, 'M.*\\*')[[1]]
[1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*"

stri_extract_all(aa, regex = ('M.*/*'))[[1]]
[1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM"

But I get a substring that starts with the first M and ends with either the last *, or with the last character of aa. I would like to get, instead, are all the substrings, even if one is nested within another:

MDDDVLPLISLFWTFGRGDVPRRY*
MCAPARH*
MDLWIRASICWGMGLLN*
MGLLN*
MNPDARGFSRV*

Here are some info on my software versions:

I'm sorry if I used the wrong lingo, I'm still new to programming.

Thank you for all your help!

Upvotes: 3

Views: 176

Answers (3)

Chris
Chris

Reputation: 3986

You can use [^\\*]* to match anything except the asterix. Noting that you want all matches, including any overlapping patterns, we can add a lookahead. This doesn't seem to be supported with stringr but works with stringi::stri_match_all_regex():

library(stringi)

stri_match_all_regex(aa, '(?=(M[^\\*]*\\*))')[[1]][,2]

# [1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*"
# [2] "MDDDVLPLISLFWTFGRGDVPRRY*"                                    
# [3] "MCAPARH*"                                                     
# [4] "MDLWIRASICWGMGLLN*"                                           
# [5] "MGLLN*"                                                       
# [6] "MNPDARGFSRV*"

Upvotes: 1

dww
dww

Reputation: 31452

The need to find all nested substrings suggests that recursion may be the simplest way:

First remove everything after the final * (since the strings we search must be delimited by a final * according to the question).

x = sub("*[^*]+$", "", aa) 

Now let's split this at every *

y = unlist(strsplit(x, '*', fixed = T))

and keep only the strings that contain at least one M

y = grep('M', y, value = T)

Now we use a recursive function to get all the substrings

find.M = function(z){
  z = sub('.+?M', 'M', z)
  if (length(zz <- grep('.+M', z, value = T))) {
    c(z, find.M(sub('.+?M','M',zz)))
  }
  else z
}

find.M(y)
# [1] "MGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY"
# [2] "MCAPARH"                                                     
# [3] "MDLWIRASICWGMGLLN"                                           
# [4] "MNPDARGFSRV"                                                 
# [5] "MDDDVLPLISLFWTFGRGDVPRRY"                                    
# [6] "MGLLN" 

Upvotes: 3

NelsonGon
NelsonGon

Reputation: 13309

EDIT: This does not exactly result in the desired output but thought I would share it(since I also spent some time on it):

library(stringi)
result<-unlist(strsplit(aa,".(?=M.*)",perl = TRUE))
res<-unlist(stri_split(unlist(result),regex="[A-Z](?<=\\*[A-Z]|(?<=\\M[A-Z]))"))
res1<-res[grep("^M",unlist(res))]
res1[stri_endswith(res1,charclass = "[*|W]")]
#[1] "MDDDVLPLISLFWTFGRGDVPRRY*" "MCAPARH*"                  "MDLWIRASICW"              
#[4] "MGLLN*"                    "MNPDARGFSRV*"

ORIGINAL:

We can use(This has removed the * at the end):

aa<-'DGHDAGRTDRPDRMGIEGTRNELPVAYHYNRTLSSNAEPLVESYLTHVLMDDDVLPLISLFWTFGRGDVPRRY*AVR*GQRRDVTTEFIHLLRCLDLSSFACMCAPARH*SRSLLIYSPKRLRNIASHRSYGIVCTSG*CTWINV*QIS*FATH*SKCIAPNLSHADKPRSLVLTPTTLRFSKPAYRRPLIREAMDLWIRASICWGMGLLN*KDWP*ESGYAYYVCELESGLRLMNPDARGFSRV*HVCSSA*LTWPSPFPEQAFLLRFTEPRHKLLYV*D*VNACLVRSSASASIM'
aa
res1<-unlist(strsplit(aa,".(?=M)",perl = TRUE))
res2<-unlist(strsplit(res1[grep("\\*{1,}",res1)],"\\*"))
res2[grep("^M",res2)]

Result:

   # [1] "MDDDVLPLISLFWTFGRGDVPRRY" "MCAPARH"                  "MGLLN"                   
   # [4] "MNPDARGFSRV" 

Upvotes: 1

Related Questions