Reputation: 303
My input file looks like this:
SL3.0ch00 maker_ITAG exon 16480 16794 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG exon 16879 17940 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG CDS 16480 16794 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG CDS 16879 17940 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
Desired output:
SL3.0ch00 maker_ITAG exon 16480 16794 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG exon 16879 17940 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG CDS 16480 16794 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG CDS 16879 17940 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
I want to remove the "gene_name "Some name";" filed from all the rows. I used the following:
awk '{prinf$13=$14=""; print $0}' input_file
But all my formattings are getting changed for the first few columns (spaces are coming instead of tabs). Kindly help. Any other command or way to this is also fine.
Upvotes: 1
Views: 73
Reputation: 203522
You have some fields separated by tabs and others separated by semi-colons followed by an optional blank. You can tell awk to split on both using FS="\t|; ?"
which will correctly identify your fields but the specific separators around each field won't be preserved and you'll need them later to put the record back together. That's why GNU awks split()
function was give a 4th arg so it can save both the fields and the separators. In your case you'd use it as:
nf = split($0,flds,/\t|; ?/,seps)
Look at what that does for the first record in your input:
$ cat tst.awk
{
nf = split($0,flds,/\t|; ?/,seps)
}
NR == 1 {
printf "$0=<%s>\n", $0
for (i=1; i<=nf; i++) {
printf " flds[%d] = <%s>\n", i, flds[i]
printf " seps[%d] = <%s>\n", i, seps[i]
}
}
.
$ awk -f tst.awk file
$0=<SL3.0ch00 maker_ITAG exon 16480 16794 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";>
flds[1] = <SL3.0ch00>
seps[1] = < >
flds[2] = <maker_ITAG>
seps[2] = < >
flds[3] = <exon>
seps[3] = < >
flds[4] = <16480>
seps[4] = < >
flds[5] = <16794>
seps[5] = < >
flds[6] = <.>
seps[6] = < >
flds[7] = <+>
seps[7] = < >
flds[8] = <.>
seps[8] = < >
flds[9] = <transcript_id "mRNA:Solyc00g005000.3.1">
seps[9] = <; >
flds[10] = <gene_id "gene:Solyc00g005000.3">
seps[10] = <; >
flds[11] = <gene_name "Solyc00g005000.3">
seps[11] = <; >
flds[12] = <gene_biotype "protein_coding">
seps[12] = <;>
flds[13] = <>
seps[13] = <>
See how not only do you have access to each field in the flds[]
array but also the separators around each field in the seps[]
array? So to delete a field all you have to do is set the appropriate element in the arrays to null and recombine the record:
$ cat tst.awk
{
nf = split($0,flds,/\t|; ?/,seps)
flds[11] = seps[11] = ""
$0 = join(nf,flds,seps)
print
}
function join(n,f,s, i,o) {for (i=1;i<=n;i++) o=o f[i] s[i]; return o}
.
$ awk -f tst.awk file
SL3.0ch00 maker_ITAG exon 16480 16794 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG exon 16879 17940 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG CDS 16480 16794 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00 maker_ITAG CDS 16879 17940 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
Upvotes: 1
Reputation: 133518
Could you please try following.(Use -F"\t"
in case of your Input_file is TAB delimited)
awk 'match($0,/ gene_name[^;]*/){print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH+1);next} 1' Input_file
Adding a non-one liner form of solution with explanation too now.
awk '
match($0,/ gene_name[^;]*/){ ##Using match function of awk where checking regex from keyword gene_name till semi colon.
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH+1) ##Printing substring from 1st character to till RSTART-1 and then RSTART+RLENGTH+1 to till last, where RSTART and RLENGTH are out of the box keywords whose value will be SET when a regex match is found in match function.
next ##next is out of box keyword which will skip all further statements from here.
}
1 ##Mentioning 1 will print the lines which do not have match of above regex for gene_name one.
' Input_file ##Mentioning Input_file name here.
Upvotes: 1
Reputation: 88636
With awk:
awk 'BEGIN{FS=OFS=";"} {print $1,$2,$4,$5}' file
With sed:
sed 's/gene_name "[^"]*"; //' file
Output:
SL3.0ch00 maker_ITAG exon 16480 16794 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding"; SL3.0ch00 maker_ITAG exon 16879 17940 . + . transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding"; SL3.0ch00 maker_ITAG CDS 16480 16794 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding"; SL3.0ch00 maker_ITAG CDS 16879 17940 . + 0 transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
See: The Stack Overflow Regular Expressions FAQ
Upvotes: 1