Madza Farias-Virgens
Madza Farias-Virgens

Reputation: 1071

Delete records for which multiple pattern conditional s across columns awk

I have a file that looks like

NC_042565.1     RefSeq  region  1       114882317       .       +       .       ID=NC_042565.1:1..114882317;Dbxref=taxon:299123;Name=1;chromosome=1;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=Mets1;mol_type=genomic DNA;sex=male;sub-species=domestica;tissue-type=blood
NC_042565.1     Gnomon  gene    21625   41521   .       -       .       ID=gene-LCMT2;Dbxref=GeneID:110474964;Name=LCMT2;gbkey=Gene;gene=LCMT2;gene_biotype=protein_coding
NC_042565.1     Gnomon  mRNA    21625   41521   .       -       .       ID=rna-XM_021538777.2;Parent=gene-LCMT2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;Name=XM_021538777.2;gbkey=mRNA;gene=LCMT2;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 9 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    41062   41521   .       -       .       ID=exon-XM_021538777.2-1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    39337   39418   .       -       .       ID=exon-XM_021538777.2-2;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    38834   39014   .       -       .       ID=exon-XM_021538777.2-3;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    36546   36702   .       -       .       ID=exon-XM_021538777.2-4;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    35950   36139   .       -       .       ID=exon-XM_021538777.2-5;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    35437   35544   .       -       .       ID=exon-XM_021538777.2-6;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    33345   33435   .       -       .       ID=exon-XM_021538777.2-7;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    30949   31197   .       -       .       ID=exon-XM_021538777.2-8;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    28678   28908   .       -       .       ID=exon-XM_021538777.2-9;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    27570   27667   .       -       .       ID=exon-XM_021538777.2-10;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    25692   25879   .       -       .       ID=exon-XM_021538777.2-11;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    25355   25490   .       -       .       ID=exon-XM_021538777.2-12;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    21625   23392   .       -       .       ID=exon-XM_021538777.2-13;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1     Gnomon  exon    11328398        11328458        .       +       .       ID=id-LOC110483275;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1     Gnomon  exon    11331449        11332392        .       +       .       ID=id-LOC110483275-2;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1     tRNAscan-SE     exon    16005736        16005808        .       +       .       ID=exon-TRNAV-UAC-1;Parent=rna-TRNAV-UAC;Dbxref=GeneID:110483291;Note=transfer RNA valine (anticodon UAC);anticodon=(pos:16005769..16005771);gbkey=tRNA;gene=TRNAV-UAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1     Gnomon  exon    40513973        40514551        .       +       .       ID=id-LOC110470572;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1     Gnomon  exon    40514711        40514960        .       +       .       ID=id-LOC110470572-2;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1     tRNAscan-SE     exon    41451994        41452066        .       +       .       ID=exon-TRNAF-GAA-1;Parent=rna-TRNAF-GAA;Dbxref=GeneID:110470583;Note=transfer RNA phenylalanine (anticodon GAA);anticodon=(pos:41452027..41452029);gbkey=tRNA;gene=TRNAF-GAA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Phe
NC_042565.1     tRNAscan-SE     exon    45245322        45245390        .       +       .       ID=exon-TRNAK-CUU-1;Parent=rna-TRNAK-CUU;Dbxref=GeneID:110468118;Note=transfer RNA lysine (anticodon CUU);anticodon=(pos:45245351..45245353);gbkey=tRNA;gene=TRNAK-CUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Lys
NC_042565.1     tRNAscan-SE     exon    49805074        49805146        .       -       .       ID=exon-TRNAV-AAC-1;Parent=rna-TRNAV-AAC;Dbxref=GeneID:110476772;Note=transfer RNA valine (anticodon AAC);anticodon=(pos:complement(49805111..49805113));gbkey=tRNA;gene=TRNAV-AAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1     tRNAscan-SE     exon    49805393        49805466        .       -       .       ID=exon-TRNAN-GUU-1;Parent=rna-TRNAN-GUU;Dbxref=GeneID:110476771;Note=transfer RNA asparagine (anticodon GUU);anticodon=(pos:complement(49805430..49805432));gbkey=tRNA;gene=TRNAN-GUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Asn
NC_042565.1     Gnomon  exon    87281852        87281945        .       +       .       ID=exon-id-LOC110480752-1;Parent=id-LOC110480752;Dbxref=GeneID:110480752;gbkey=V_segment;gene=LOC110480752;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;standard_name=T cell receptor beta variable 14-like

I need to delete lines for which

$3 ~ /exon|guide_RNA|lnc_RNA\t|mRNA|snoRNA|snRNA\t|transcript/

AND

you find the string /gene=/ but not /transcript_id=/ in the last column

I tried spliting the columns by ; and doing just to see if I can at least capture the correct lines and then figure out how to delete them, but I keep getting the same whole file as output

 awk 'BEGIN { FS = ";" } NR==1 {for(i=1;i<=NF;i++) if ($1 ~ /exon\t|guide_RNA\t|lnc_RNA\t|mRNA\t|snoRNA\t|snRNA\t|transcript\t/ && $i ~ /gene=/ && $i !~ /transcript_id=/) f=i;next} {print $f}' BFgenomic.gff 

lines I wanted to delete:

awk '$3 ~ /exon|guide_RNA|lnc_RNA|mRNA|snoRNA|snRNA|transcript/' BFgenomic.gff | grep -v transcript_id= | grep gene=
NC_042565.1     Gnomon  exon    11328398        11328458        .       +       .       ID=id-LOC110483275;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1     Gnomon  exon    11331449        11332392        .       +       .       ID=id-LOC110483275-2;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1     tRNAscan-SE     exon    16005736        16005808        .       +       .       ID=exon-TRNAV-UAC-1;Parent=rna-TRNAV-UAC;Dbxref=GeneID:110483291;Note=transfer RNA valine (anticodon UAC);anticodon=(pos:16005769..16005771);gbkey=tRNA;gene=TRNAV-UAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1     Gnomon  exon    40513973        40514551        .       +       .       ID=id-LOC110470572;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1     Gnomon  exon    40514711        40514960        .       +       .       ID=id-LOC110470572-2;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1     tRNAscan-SE     exon    41451994        41452066        .       +       .       ID=exon-TRNAF-GAA-1;Parent=rna-TRNAF-GAA;Dbxref=GeneID:110470583;Note=transfer RNA phenylalanine (anticodon GAA);anticodon=(pos:41452027..41452029);gbkey=tRNA;gene=TRNAF-GAA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Phe
NC_042565.1     tRNAscan-SE     exon    45245322        45245390        .       +       .       ID=exon-TRNAK-CUU-1;Parent=rna-TRNAK-CUU;Dbxref=GeneID:110468118;Note=transfer RNA lysine (anticodon CUU);anticodon=(pos:45245351..45245353);gbkey=tRNA;gene=TRNAK-CUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Lys
NC_042565.1     tRNAscan-SE     exon    49805074        49805146        .       -       .       ID=exon-TRNAV-AAC-1;Parent=rna-TRNAV-AAC;Dbxref=GeneID:110476772;Note=transfer RNA valine (anticodon AAC);anticodon=(pos:complement(49805111..49805113));gbkey=tRNA;gene=TRNAV-AAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1     tRNAscan-SE     exon    49805393        49805466        .       -       .       ID=exon-TRNAN-GUU-1;Parent=rna-TRNAN-GUU;Dbxref=GeneID:110476771;Note=transfer RNA asparagine (anticodon GUU);anticodon=(pos:complement(49805430..49805432));gbkey=tRNA;gene=TRNAN-GUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Asn
NC_042565.1     Gnomon  exon    87281852        87281945        .       +       .       ID=exon-id-LOC110480752-1;Parent=id-LOC110480752;Dbxref=GeneID:110480752;gbkey=V_segment;gene=LOC110480752;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;standard_name=T cell receptor beta variable 14-like

Upvotes: 1

Views: 60

Answers (2)

anubhava
anubhava

Reputation: 785611

You may consider:

awk -F '\t' '!(
   $3 ~ /^(exon|(guide_|lnc_|m|sno?)RNA|transcript)$/ &&
   $NF ~ /(^|;)gene=/ && 
   $NF !~ /(^|;)transcript_id=/
)' file
  • Since you are comparing only $3 there will be no presence of \t in a tab delimited file. Better to use anchors ^ and $ as shown here.
  • For last field use (^|;) to make sure there are no partial matches in that field
  • Take note of refactoring of alternations of $3
  • Take note of negation block from start to end !(...)

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 247012

Do you want:

awk -F'\t' '
    $3 ~ /exon|guide_RNA|lnc_RNA|mRNA|snoRNA|snRNA|transcript/ && \
      $NF ~/gene=/ && \
      $NF !~ /transcript_id=/ {next}
    {print}
' ~/tmp/file

Upvotes: 1

Related Questions