Reputation: 1071
I have a file that looks like
NC_042565.1 RefSeq region 1 114882317 . + . ID=NC_042565.1:1..114882317;Dbxref=taxon:299123;Name=1;chromosome=1;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=Mets1;mol_type=genomic DNA;sex=male;sub-species=domestica;tissue-type=blood
NC_042565.1 Gnomon gene 21625 41521 . - . ID=gene-LCMT2;Dbxref=GeneID:110474964;Name=LCMT2;gbkey=Gene;gene=LCMT2;gene_biotype=protein_coding
NC_042565.1 Gnomon mRNA 21625 41521 . - . ID=rna-XM_021538777.2;Parent=gene-LCMT2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;Name=XM_021538777.2;gbkey=mRNA;gene=LCMT2;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 9 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 41062 41521 . - . ID=exon-XM_021538777.2-1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 39337 39418 . - . ID=exon-XM_021538777.2-2;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 38834 39014 . - . ID=exon-XM_021538777.2-3;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 36546 36702 . - . ID=exon-XM_021538777.2-4;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 35950 36139 . - . ID=exon-XM_021538777.2-5;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 35437 35544 . - . ID=exon-XM_021538777.2-6;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 33345 33435 . - . ID=exon-XM_021538777.2-7;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 30949 31197 . - . ID=exon-XM_021538777.2-8;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 28678 28908 . - . ID=exon-XM_021538777.2-9;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 27570 27667 . - . ID=exon-XM_021538777.2-10;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 25692 25879 . - . ID=exon-XM_021538777.2-11;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 25355 25490 . - . ID=exon-XM_021538777.2-12;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 21625 23392 . - . ID=exon-XM_021538777.2-13;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XM_021538777.2;gbkey=mRNA;gene=LCMT2;product=leucine carboxyl methyltransferase 2;transcript_id=XM_021538777.2
NC_042565.1 Gnomon exon 11328398 11328458 . + . ID=id-LOC110483275;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1 Gnomon exon 11331449 11332392 . + . ID=id-LOC110483275-2;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1 tRNAscan-SE exon 16005736 16005808 . + . ID=exon-TRNAV-UAC-1;Parent=rna-TRNAV-UAC;Dbxref=GeneID:110483291;Note=transfer RNA valine (anticodon UAC);anticodon=(pos:16005769..16005771);gbkey=tRNA;gene=TRNAV-UAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1 Gnomon exon 40513973 40514551 . + . ID=id-LOC110470572;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1 Gnomon exon 40514711 40514960 . + . ID=id-LOC110470572-2;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1 tRNAscan-SE exon 41451994 41452066 . + . ID=exon-TRNAF-GAA-1;Parent=rna-TRNAF-GAA;Dbxref=GeneID:110470583;Note=transfer RNA phenylalanine (anticodon GAA);anticodon=(pos:41452027..41452029);gbkey=tRNA;gene=TRNAF-GAA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Phe
NC_042565.1 tRNAscan-SE exon 45245322 45245390 . + . ID=exon-TRNAK-CUU-1;Parent=rna-TRNAK-CUU;Dbxref=GeneID:110468118;Note=transfer RNA lysine (anticodon CUU);anticodon=(pos:45245351..45245353);gbkey=tRNA;gene=TRNAK-CUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Lys
NC_042565.1 tRNAscan-SE exon 49805074 49805146 . - . ID=exon-TRNAV-AAC-1;Parent=rna-TRNAV-AAC;Dbxref=GeneID:110476772;Note=transfer RNA valine (anticodon AAC);anticodon=(pos:complement(49805111..49805113));gbkey=tRNA;gene=TRNAV-AAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1 tRNAscan-SE exon 49805393 49805466 . - . ID=exon-TRNAN-GUU-1;Parent=rna-TRNAN-GUU;Dbxref=GeneID:110476771;Note=transfer RNA asparagine (anticodon GUU);anticodon=(pos:complement(49805430..49805432));gbkey=tRNA;gene=TRNAN-GUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Asn
NC_042565.1 Gnomon exon 87281852 87281945 . + . ID=exon-id-LOC110480752-1;Parent=id-LOC110480752;Dbxref=GeneID:110480752;gbkey=V_segment;gene=LOC110480752;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;standard_name=T cell receptor beta variable 14-like
I need to delete lines for which
$3 ~ /exon|guide_RNA|lnc_RNA\t|mRNA|snoRNA|snRNA\t|transcript/
AND
you find the string /gene=/
but not /transcript_id=/
in the last column
I tried spliting the columns by ;
and doing just to see if I can at least capture the correct lines and then figure out how to delete them, but I keep getting the same whole file as output
awk 'BEGIN { FS = ";" } NR==1 {for(i=1;i<=NF;i++) if ($1 ~ /exon\t|guide_RNA\t|lnc_RNA\t|mRNA\t|snoRNA\t|snRNA\t|transcript\t/ && $i ~ /gene=/ && $i !~ /transcript_id=/) f=i;next} {print $f}' BFgenomic.gff
lines I wanted to delete:
awk '$3 ~ /exon|guide_RNA|lnc_RNA|mRNA|snoRNA|snRNA|transcript/' BFgenomic.gff | grep -v transcript_id= | grep gene=
NC_042565.1 Gnomon exon 11328398 11328458 . + . ID=id-LOC110483275;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1 Gnomon exon 11331449 11332392 . + . ID=id-LOC110483275-2;Parent=gene-LOC110483275;Dbxref=GeneID:110483275;gbkey=exon;gene=LOC110483275;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 11%25 coverage of the annotated genomic feature by RNAseq alignments
NC_042565.1 tRNAscan-SE exon 16005736 16005808 . + . ID=exon-TRNAV-UAC-1;Parent=rna-TRNAV-UAC;Dbxref=GeneID:110483291;Note=transfer RNA valine (anticodon UAC);anticodon=(pos:16005769..16005771);gbkey=tRNA;gene=TRNAV-UAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1 Gnomon exon 40513973 40514551 . + . ID=id-LOC110470572;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1 Gnomon exon 40514711 40514960 . + . ID=id-LOC110470572-2;Parent=gene-LOC110470572;Dbxref=GeneID:110470572;gbkey=exon;gene=LOC110470572;model_evidence=Supporting evidence includes similarity to: 1 Protein
NC_042565.1 tRNAscan-SE exon 41451994 41452066 . + . ID=exon-TRNAF-GAA-1;Parent=rna-TRNAF-GAA;Dbxref=GeneID:110470583;Note=transfer RNA phenylalanine (anticodon GAA);anticodon=(pos:41452027..41452029);gbkey=tRNA;gene=TRNAF-GAA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Phe
NC_042565.1 tRNAscan-SE exon 45245322 45245390 . + . ID=exon-TRNAK-CUU-1;Parent=rna-TRNAK-CUU;Dbxref=GeneID:110468118;Note=transfer RNA lysine (anticodon CUU);anticodon=(pos:45245351..45245353);gbkey=tRNA;gene=TRNAK-CUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Lys
NC_042565.1 tRNAscan-SE exon 49805074 49805146 . - . ID=exon-TRNAV-AAC-1;Parent=rna-TRNAV-AAC;Dbxref=GeneID:110476772;Note=transfer RNA valine (anticodon AAC);anticodon=(pos:complement(49805111..49805113));gbkey=tRNA;gene=TRNAV-AAC;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Val
NC_042565.1 tRNAscan-SE exon 49805393 49805466 . - . ID=exon-TRNAN-GUU-1;Parent=rna-TRNAN-GUU;Dbxref=GeneID:110476771;Note=transfer RNA asparagine (anticodon GUU);anticodon=(pos:complement(49805430..49805432));gbkey=tRNA;gene=TRNAN-GUU;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Asn
NC_042565.1 Gnomon exon 87281852 87281945 . + . ID=exon-id-LOC110480752-1;Parent=id-LOC110480752;Dbxref=GeneID:110480752;gbkey=V_segment;gene=LOC110480752;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;standard_name=T cell receptor beta variable 14-like
Upvotes: 1
Views: 60
Reputation: 785611
You may consider:
awk -F '\t' '!(
$3 ~ /^(exon|(guide_|lnc_|m|sno?)RNA|transcript)$/ &&
$NF ~ /(^|;)gene=/ &&
$NF !~ /(^|;)transcript_id=/
)' file
$3
there will be no presence of \t
in a tab delimited file. Better to use anchors ^
and $
as shown here.(^|;)
to make sure there are no partial matches in that field$3
!(...)
Upvotes: 2
Reputation: 247012
Do you want:
awk -F'\t' '
$3 ~ /exon|guide_RNA|lnc_RNA|mRNA|snoRNA|snRNA|transcript/ && \
$NF ~/gene=/ && \
$NF !~ /transcript_id=/ {next}
{print}
' ~/tmp/file
Upvotes: 1