Reputation: 115
Would anyone happen to have any insight or suggestion into this error i.e. can this be 'fixed' and if so, how best?
awk: line 1: regular expression /splice_acc ... exceeds implementation size limit
The expression used in my bash script was...
grep -v '^##' $IN | awk 'BEGIN{FS=" "; OFS=" "} $1~/#CHROM/ || $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/ || $11~/^0\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=""; print $0}' | sed 's/ */ /g' | awk '{split($9,a,":"); split(a[2],b,","); if (b[1]>b[2] || $1~/#CHROM/) print $0}' > $OUT
Thanks for any help given, very much appreciated.
Thank you for your suggestions!
A sample of the input is:
Chr1 926694 . C T 2510.49 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=82;CIGAR=1X;DP=85;DPB=85;DPRA=0;EPP=6.82362;EPPR=9.52472;GTI=0;LEN=1;MEANALT=1;MQM=57.0854;MQMR=60;NS=1;NUMALT=1;ODDS=108.152;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2916;QR=42;RO=3;RPL=46;RPP=5.65844;RPPR=9.52472;RPR=36;RUN=1;SAF=45;SAP=4.70511;SAR=37;SRF=0;SRP=9.52472;SRR=3;TYPE=snp;ANN=T|upstream_gene_variant|MODIFIER|AT1G03720|AT1G03720|transcript|AT1G03720.1|protein_coding||c.-321G>A|||||321|,T|downstream_gene_variant|MODIFIER|AT1G03700|AT1G03700|transcript|AT1G03700.1|protein_coding||c.*4850C>T|||||4793|,T|downstream_gene_variant|MODIFIER|AT1G03710|AT1G03710|transcript|AT1G03710.1|protein_coding||c.*2407C>T|||||1968|,T|downstream_gene_variant|MODIFIER|AT1G03730|AT1G03730|transcript|AT1G03730.1|protein_coding||c.*4323G>A|||||4134|,T|downstream_gene_variant|MODIFIER|AT1G03710|AT1G03710|transcript|AT1G03710.2|protein_coding||c.*2407C>T|||||2339|,T|intergenic_region|MODIFIER|AT1G03720-AT1G03730|AT1G03720-AT1G03730|intergenic_region|AT1G03720-AT1G03730|||n.926694C>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:85:3,82:3:42:82:2916:-252.316,-21.6676,0
Upvotes: 1
Views: 512
Reputation: 647
Rather than trying to fit everything into one part there, I've tried to separate that out into smaller bits, as I was struggling to wrap my head round the whole thing.
BEGIN {
FS=" ";
OFS=" "
# this is your big list of words that was making awk choke.
# this list is available to the function test_words.
split("splice_acceptor_variant splice_donor_variant splice_region_variant"\
" stop_lost start_lost stop_gained missense_variant coding_sequence_variant"\
" inframe_insertion disruptive_inframe_insertion inframe_deletion"\
" disruptive_inframe_deletion exon_variant exon_loss_variant exon_loss_variant"\
" duplication inversion frameshift_variant feature_ablation duplication"\
" gene_fusion bidirectional_gene_fusion rearranged_at_DNA_level"\
" miRNA initiator_codon_variant start_retained", test_word_arr)
}
function test_words(hs) {
# if any words from test_word_arr are in the string passed
# to this function, return true
for (i in test_word_arr) {
if (match(hs, test_word_arr[i])) return 1;
}
return 0;
}
# apply the initial sed command
/^##/ { next }
# it appears to me that any string that starts '#CHROM' should
# be printed with minimal editing - it has automatically passed
# the test for the second `awk` script
$1 ~ /#CHROM/ {
$3 = "";
$7 = "";
gsub(/ */, " ")
print $0
}
# these were all the conditions that were expected to be true to
# perform the final processing. So they can be checked off one
# by one, and if any are *not* true, the line can be skipped.
$10 !~ /^1\/1/ { next }
$11 !~ (/^1\/0/ || /^0\/[01]/) { next }
$1 !~ /^[X[:digit:]]*$/ { next }
# this is performing the test that couldn't be done previously
test_words($0) == 0 { next }
{
# finally, any line still being assessed has 'passed' so
# perform the processing from your first awk script.
$3 = "";
$7="";
# this is basically the following `sed` script
gsub(/ */, " ")
# and this is the final awk script
split($9, a, ":"); split(a[2], b, ",");
if (b[1] > b[2])
print $0
}
As there is no example input / output, this is untested, so any problems may need to be checked and edited..
Upvotes: 1