jnorth
jnorth

Reputation: 115

awk ...regular expression ..exceeds implementation size limit

Would anyone happen to have any insight or suggestion into this error i.e. can this be 'fixed' and if so, how best?

awk: line 1: regular expression /splice_acc ... exceeds implementation size limit

The expression used in my bash script was...

grep -v '^##' $IN | awk 'BEGIN{FS=" "; OFS=" "} $1~/#CHROM/ || $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/ || $11~/^0\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=""; print $0}' | sed 's/ */ /g' | awk '{split($9,a,":"); split(a[2],b,","); if (b[1]>b[2] || $1~/#CHROM/) print $0}' > $OUT

Thanks for any help given, very much appreciated.

Thank you for your suggestions!

A sample of the input is:

Chr1 926694 . C T 2510.49 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=82;CIGAR=1X;DP=85;DPB=85;DPRA=0;EPP=6.82362;EPPR=9.52472;GTI=0;LEN=1;MEANALT=1;MQM=57.0854;MQMR=60;NS=1;NUMALT=1;ODDS=108.152;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2916;QR=42;RO=3;RPL=46;RPP=5.65844;RPPR=9.52472;RPR=36;RUN=1;SAF=45;SAP=4.70511;SAR=37;SRF=0;SRP=9.52472;SRR=3;TYPE=snp;ANN=T|upstream_gene_variant|MODIFIER|AT1G03720|AT1G03720|transcript|AT1G03720.1|protein_coding||c.-321G>A|||||321|,T|downstream_gene_variant|MODIFIER|AT1G03700|AT1G03700|transcript|AT1G03700.1|protein_coding||c.*4850C>T|||||4793|,T|downstream_gene_variant|MODIFIER|AT1G03710|AT1G03710|transcript|AT1G03710.1|protein_coding||c.*2407C>T|||||1968|,T|downstream_gene_variant|MODIFIER|AT1G03730|AT1G03730|transcript|AT1G03730.1|protein_coding||c.*4323G>A|||||4134|,T|downstream_gene_variant|MODIFIER|AT1G03710|AT1G03710|transcript|AT1G03710.2|protein_coding||c.*2407C>T|||||2339|,T|intergenic_region|MODIFIER|AT1G03720-AT1G03730|AT1G03720-AT1G03730|intergenic_region|AT1G03720-AT1G03730|||n.926694C>T|||||| GT:DP:AD:RO:QR:AO:QA:GL 1/1:85:3,82:3:42:82:2916:-252.316,-21.6676,0

Upvotes: 1

Views: 512

Answers (1)

Guy
Guy

Reputation: 647

Rather than trying to fit everything into one part there, I've tried to separate that out into smaller bits, as I was struggling to wrap my head round the whole thing.

BEGIN {
    FS=" "; 
    OFS=" "
    # this is your big list of words that was making awk choke.
    # this list is available to the function test_words.
    split("splice_acceptor_variant splice_donor_variant splice_region_variant"\
          " stop_lost start_lost stop_gained missense_variant coding_sequence_variant"\
          " inframe_insertion disruptive_inframe_insertion inframe_deletion"\
          " disruptive_inframe_deletion exon_variant exon_loss_variant exon_loss_variant"\
          " duplication inversion frameshift_variant feature_ablation duplication"\
          " gene_fusion bidirectional_gene_fusion rearranged_at_DNA_level"\
          " miRNA initiator_codon_variant start_retained", test_word_arr)
} 

function test_words(hs) {
    # if any words from test_word_arr are in the string passed 
    # to this function, return true        
    for (i in test_word_arr) {
        if (match(hs, test_word_arr[i])) return 1;
    }
    return 0;
}

# apply the initial sed command
/^##/ { next }

# it appears to me that any string that starts '#CHROM' should 
# be printed with minimal editing - it has automatically passed
# the test for the second `awk` script 
$1 ~ /#CHROM/ {
    $3 = "";
    $7 = "";
    gsub(/  */, " ")
    print $0
}

# these were all the conditions that were expected to be true to
# perform the final processing. So they can be checked off one 
# by one, and if any are *not* true, the line can be skipped.
$10 !~ /^1\/1/ { next }
$11 !~ (/^1\/0/ || /^0\/[01]/) { next }
$1  !~ /^[X[:digit:]]*$/ { next }
# this is performing the test that couldn't be done previously
test_words($0) == 0 { next }

{
    # finally, any line still being assessed has 'passed' so 
    # perform the processing from your first awk script.
    $3 = "";
    $7="";

    # this is basically the following `sed` script
    gsub(/  */, " ")

    # and this is the final awk script
    split($9, a, ":"); split(a[2], b, ",");
    if (b[1] > b[2])
        print $0
}

As there is no example input / output, this is untested, so any problems may need to be checked and edited..

Upvotes: 1

Related Questions