Reputation: 35

Interval expressions in gawk to awk

I hope this is an easy fix

I originally wrote a clean and easy script that utilized gawk, I used this first and foremost because when I was solving the original issue was what I found. I now need to adapt it to only use awk.

sample file.fasta:

>gene1

>gene235
ATGCTTAGATTTACAATTCAGAAATTCCTGGTCTATTAACCCTCCTTCACTTTTCACTTTTCCCTAACCCTTCAAAATTTTATATCCAATCTTCTCACCCTCTACAATAATACATTTATTATCCTCTTACTTCAAAATTTTT

>gene335
ATGCTCCTTCTTAATCTAAACCTTCAAAATTTTCCCCCTCACATTTATCCATTATCACCTTCATTTCGGAATCCTTAACTAAATACAATCATCAACCATCTTTTAACATAACTTCTTCAAAATTTTACCAACTTACTATTGCTTCAAAATTTTTCAT

>gene406
ATGTACCACACACCCCCATCTTCCATTTTCCCTTTATTCTCCTCACCTCTACAATCCCCTTAATTCCTCTTCAAAATTTTTGGAGCCCTTAACTTTCAATAACTTCAAAATTTTTCACCATACCAATAATATCCCTCTTCAAAATTTTCCACACTCACCAAC


gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta

what I know works is awk is the following:

awk '/[ACTG]GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta

the culprit therefore is the interval expression of {21,}

What I want it to do is search is for it to match each line that contains at least 21 nucleotides left of my "GG" match.

Can anyone help?

Edit:

Thanks for all the help: There are various solutions that worked. To reply to some of the comments a more basic example of the initial output and the desired effect achieved...

Prior to awk command: cat file1.fasta

>gene1
ATGCCTTAACTTTCAATAACTGG 
>gene2
ATGGGTGCCTTAACTTTCAATAACTG
>gene3
ATGTCAAAATTTTTCATTTCAAT
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Following codes all produced the same desired output:

original code

gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta

slight modification that adds interval function to original awk version >3.x.x

awk --re-interval'/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta

Allows for modification of val and correct output , untested but should work with lower versions of awk

awk -v usr_count="21" '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>= usr_count){print id ORS $0};id=""}' file1.fasta

awk --re-interval '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS} /^>/{name=$0; seq=""; next} {seq = seq $0 } END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file1.fasta

Desired output: only grab genes names and sequences of sequences that have 21 nucleotides prior to matching GG

>gene1
ATGCCTTAACTTTCAATAACTGG 
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Lastly just to show the discarded lines

>gene2
ATG-GG-TGCCTTAACTTTCAATAACTG # only 3 nt prior to any GG combo 
>gene3
ATGTCAAAATTTTTCATTTCAAT # No GG match found 
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA # only 14 nt prior to any GG combo

Hope this helps others!

Upvotes: 2

Answers (3)

Ed Morton

Reputation: 203635

Sounds like what you want is:

awk 'match($0,/[ACTG]+GG/) && RLENGTH>22{print a; print} {a=$0}' file

but this is probably all you need given the sample input you provided:

awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file

They'll both work in any awk.

Using your updated sample input:

$ awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Upvotes: 0

RavinderSingh13

Reputation: 133528

EDIT: As per OP comment need to print gene ids too then try following.

awk '
/gene/{
  id=$0
  next
}
match($0,/.*GG/){
  val=substr($0,RSTART,RLENGTH-2)
  if(gsub(/[ACTG]/,"&",val)>=21){
    print id ORS $0
  }
  id=""
}
' Input_file

OR one-liner form of above solution as per OP's request:

awk '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>=21){print id ORS $0};id=""}' Input_file

Could you please try following, written and tested with shown samples only.

awk '
match($0,/.*GG/){
  val=substr($0,RSTART,RLENGTH-2)
  if(gsub(/[ACTG]/,"&",val)>=21){
    print
  }
}
' Input_file

OR more generic approach where created a variable in which user could mention value which user is looking to match should be present before GG.

awk -v usr_count="21" '
match($0,/.*GG/){
  val=substr($0,RSTART,RLENGTH-2)
  if(gsub(/[ACTG]/,"&",val)>=usr_count){
    print
  }
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                                ##Starting awk program from here.
match($0,/.*GG/){                    ##Using Match function to match everything till GG in current line.
  val=substr($0,RSTART,RLENGTH-2)    ##Storing sub-string of current line from RSTART till RLENGTH-2 into variable val here.
  if(gsub(/[ACTG]/,"&",val)>=21){    ##Checking condition if global substitution of ACTG(with same value) is greater or equal to 21 then do following.
    print                            ##Printing current line then.
  }
}
' Input_file                         ##Mentioning Input_file name here.

Upvotes: 2

kvantour

Reputation: 26481

GNU awk accepts interval expressions in regular expressions from version 3.0 onwards. However, only from version 4.0, interval expression became defaultly enabled. If you have awk 3.x.x, you have to use the flag --re-interval to enable them.

awk --re-interval '/a{3,6}/{print}' file

There is an issue that often people overlook with FASTA files and using awk. When you have multi-line sequences, it is possible that your match is covering multiple lines. To this end you need to combine your sequences first.

The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or white-spaces due. A generic awk for processing fasta, looks like this:

awk '/^>/ && seq { **process_sequence_here** }
     /^>/{name=$0; seq=""; next}
     {seq = seq $0 }
     END { **process_sequence_here** }' file.fasta

In the presented case, your sequence processing looks like:

awk '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS}
     /^>/{name=$0; seq=""; next}
     {seq = seq $0 }
     END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file.fasta

Upvotes: 2

Interval expressions in gawk to awk

Answers (3)

Related Questions