Reputation: 87
Given the following string,
>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG
I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,
grep -EIho -B1 '([^G]G[^G]){6,}' file
which outputs
>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH
Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.
Expected output,
$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384
The awk solution,
awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Really appreciate any help!
Upvotes: 1
Views: 240
Reputation: 16950
Here's a basic awk
solution:
The algorithm first searches the parts of the line that match [^G]G[^G]{6,}
, then searches for the occurrences of SGA
and TGA
in those parts. The implementation is a little tedious, as there's no offset
option for the match()
and index()
functions of awk
.
awk '
BEGIN {
regexp = "([^G]G[^G]){6,}"
search["SGA"]
search["TGA"]
}
/^>/ {
print
next
}
{
i0 = 1
s0 = $0
while ( match( s0, regexp ) ) {
head = substr(s0,RSTART,RLENGTH)
tail = substr(s0,RSTART+RLENGTH)
i0 += RSTART - 1
for (s in search) {
s1 = head
i1 = i0
while ( i = index(s1, s) ) {
s1 = substr(s1, i+1)
i1 += i
search[s] = search[s] " " i1-1
}
}
s0 = tail
i0 += RLENGTH
}
for (s in search) {
print s ":" search[s]
search[s] = ""
}
}
'
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:
RSTART+1
of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.Upvotes: 4
Reputation: 133458
With your shown samples please try following awk
code. Written and tested in GNU awk
should work in any POSIX awk
. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk
variable named keyWords
and it will look for all those into the lines.
awk -v keyWords="SGA,TGA" '
BEGIN{
num=split(keyWords,arr1,",")
for(i=1;i<=num;i++){
checkValues[arr1[i]]
}
}
!/>/{
start=diff=prev=""
while(match($0,/(.G.){6,}/)){
lineMatch=substr($0,RSTART,RLENGTH)
start+=(RSTART>1?RSTART-1:RSTART)
diff=(start-prev)
for(key in checkValues){
if(ind=index(lineMatch,key)){
print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
}
prev=start
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Output with shown samples will be as follows:
>Q07092
SGA: 384
Upvotes: 3
Reputation: 428
You could match all occurrences of the regular expression [ST]GA
and look at the wider substring surrounding each match to compare that window to (.G.){6}
. Here is some code to do that:
$ awk '
/^>/ { label = $0 ORS; next }
{
while (match(substr($0, pos + 1), /[ST]GA/)) {
pos += RSTART
if (len = RLENGTH) {
wbeg = pos - 18 + len # 18 is the length of .G..G..G..G..G..G.
wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
wbeg = (wbeg < 1 ? 1 : wbeg) # substr must start from at least 1
window = substr($0, wbeg, wlen)
if (window ~ /.G..G..G..G..G..G./) {
str = substr($0, pos, len)
print label str ":", pos + int(len / 2)
label = ""
}
pos += len - 1
}
if (pos >= length($0)) {
break
}
}
pos = 0
}
' file
>Q07092
SGA: 384
The output only shows SGA: 384
because that is the only portion of the example input that meets the requirement:
I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Upvotes: 4