Jalan
Jalan

Reputation: 87

grep variable pattern and output match and sequence position

Given the following string,

>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG

I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,

grep -EIho -B1 '([^G]G[^G]){6,}' file

which outputs

>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH

Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.

Expected output,

$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384

The awk solution,

awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file

outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.

Really appreciate any help!

Upvotes: 1

Views: 240

Answers (3)

Fravadona
Fravadona

Reputation: 16950

Here's a basic awk solution:

  • Each sequence must span a single line
  • The resulting positions are relatives to the start of the line

The algorithm first searches the parts of the line that match [^G]G[^G]{6,}, then searches for the occurrences of SGA and TGA in those parts. The implementation is a little tedious, as there's no offset option for the match() and index() functions of awk.

awk '
    BEGIN {
        regexp = "([^G]G[^G]){6,}"
        search["SGA"]
        search["TGA"]
    }
    /^>/ {
        print
        next
    }
    {
        i0 = 1
        s0 = $0
        while ( match( s0, regexp ) ) {
            head = substr(s0,RSTART,RLENGTH)
            tail = substr(s0,RSTART+RLENGTH)
            i0 += RSTART - 1
            for (s in search) {
                s1 = head
                i1 = i0
                while ( i = index(s1, s) ) {
                    s1 = substr(s1, i+1)
                    i1 += i
                    search[s] =  search[s] " " i1-1
                }
            }
            s0 = tail
            i0 += RLENGTH
        }
        for (s in search) {
            print s ":" search[s]
            search[s] = ""
        }
    }
'

Example with simplified sequences
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:

TODO
  • Parameterize the regex and the search strings: it's not difficult per se but the current code will run into an infinite loop when a search string is empty or when the regex allows 0-length matches; you'll need to prevent that from happening.
  • Allow multi-line sequences
  • Allow overlapping matches for the regex. Basically, it means to look for the next match at RSTART+1 of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.

Upvotes: 4

RavinderSingh13
RavinderSingh13

Reputation: 133458

With your shown samples please try following awk code. Written and tested in GNU awk should work in any POSIX awk. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk variable named keyWords and it will look for all those into the lines.

awk -v keyWords="SGA,TGA" '
BEGIN{
  num=split(keyWords,arr1,",")
  for(i=1;i<=num;i++){
     checkValues[arr1[i]]
  }
}
!/>/{
  start=diff=prev=""
  while(match($0,/(.G.){6,}/)){
     lineMatch=substr($0,RSTART,RLENGTH)
     start+=(RSTART>1?RSTART-1:RSTART)
     diff=(start-prev)
     for(key in checkValues){
       if(ind=index(lineMatch,key)){
          print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
       }
       prev=start
     }
     $0=substr($0,RSTART+RLENGTH)
   }
}
'  Input_file

Output with shown samples will be as follows:

>Q07092
SGA: 384

Upvotes: 3

rowboat
rowboat

Reputation: 428

You could match all occurrences of the regular expression [ST]GA and look at the wider substring surrounding each match to compare that window to (.G.){6}. Here is some code to do that:

$ awk '
/^>/ { label = $0 ORS; next }
{
    while (match(substr($0, pos + 1), /[ST]GA/)) {
        pos += RSTART
        if (len = RLENGTH) {
            wbeg = pos - 18 + len   # 18 is the length of .G..G..G..G..G..G.
            wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
            wbeg = (wbeg < 1 ? 1 : wbeg)    # substr must start from at least 1
            window = substr($0, wbeg, wlen)
            if (window ~ /.G..G..G..G..G..G./) {
                str = substr($0, pos, len)
                print label str ":", pos + int(len / 2)
                label = ""
            }
            pos += len - 1
        }
        if (pos >= length($0)) {
            break
        }
    }
    pos = 0
}
' file
>Q07092
SGA: 384

The output only shows SGA: 384 because that is the only portion of the example input that meets the requirement:

I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.

Upvotes: 4

Related Questions