biocode
biocode

Reputation: 109

Bashscript error in Ubuntu: awk: line 1: regular expression exceeds implementation size limit

I am trying to apply this code on an annotated file generated by snpEff: (My OS is Ubuntu)

grep -v '^##' /home/zee/fdr_vs_wt.snp.annotated.vcf | awk 'BEGIN{FS=" "; OFS=" "} $1~/SL2.50chch/ || $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/ || $11~/^0\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=""; print $0}' | sed 's/  */ /g' | awk '{split($9,a,":"); split(a[2],b,","); if (b[1]>b[2] || $1~/SL2.50ch/) print $0}' > /home/zee/fdr_vs_wt.raw.vcfmutantbulk.cands2.txt

I get the following error:

awk: line 1: regular expression /splice_acc ... exceeds implementation size limit

Can anyone please help? I know this question was asked by another person a while ago but I am not technically strong and I did not understand the solutions given. Thanks in advance.

I also intend to use this code in my Java GUI later, I will be using ProcessBuilder to run it with the following code:

    speciesFastaVersionCH = "SL2.50";

    String longInputcmd4b = "ch/ || $10~/^1\\/1/ && ($11~/^1\\/0/ || $11~/^0\\/0/ || $11~/^0\\/1/) && $1~/^[0-9X]*$/ && /splice_acceptor_variant|splice_donor_variant|splice_region_variant|stop_lost|start_lost|stop_gained|missense_variant|coding_sequence_variant|inframe_insertion|disruptive_inframe_insertion|inframe_deletion|disruptive_inframe_deletion|exon_variant|exon_loss_variant|exon_loss_variant|duplication|inversion|frameshift_variant|feature_ablation|duplication|gene_fusion|bidirectional_gene_fusion|rearranged_at_DNA_level|miRNA|initiator_codon_variant|start_retained/ {$3=$7=\"\"; print $0}' | sed 's/  */ /g' | awk '{split($9,a,\":\"); split(a[2],b,\",\"); if (b[1]>b[2] || $1~/";
    StringBuilder cmd4 = new StringBuilder().append("\"").append("grep -v '^##' ").append(outputFilecmd3).append(" | awk 'BEGIN{FS=\" \"; OFS=\" \"} $1~/").append(speciesFastaVersionCH).append(longInputcmd4b).append(speciesFastaVersionCH).append("ch/) print $0}' > ").append(outputFilecmd5).append("\"");



    System.out.println("Here is cmd4:" + cmd4.toString());
    String [] gatkArray1 = cmd1.split(" ");
    String [] gatkArray2 = cmd2.split(" ");
    String [] gatkArray3 = {"bash", "-c", cmd3};


    String [][] gatkArrays = {gatkArray1, gatkArray2, gatkArray3};


    ProcessBuilder pb = new ProcessBuilder(gatkArray3);
    pb.redirectOutput(ProcessBuilder.Redirect.INHERIT);
    pb.redirectError(ProcessBuilder.Redirect.INHERIT);
    Process p = pb.start();

Upvotes: 1

Views: 260

Answers (1)

that other guy
that other guy

Reputation: 123550

Your implementation of awk doesn't support regular expressions of that length.

Specifically, you are using mawk where the max regex limit is 400 including the //:

$ true | mawk "/$(printf '%397s')/"
(no output)

$ true | mawk "/$(printf '%398s')/" 
mawk: line 1: regular expression /           ... exceeds implementation size limit

You can either rewrite your awk script to use shorter regex literals (the maximum size guaranteed by POSIX is 256 bytes), or switch to an implementation like gawk where the only limit is Linux's maximum argument size of 128KiB:

$ true | gawk "/$(printf '%131069s')/"
(no output)

$ true | gawk "/$(printf '%131070s')/"
bash: /usr/bin/gawk: Argument list too long

Upvotes: 1

Related Questions