Reputation: 303

How to delete specific columns in a file without any formatting change

My input file looks like this:

SL3.0ch00   maker_ITAG  exon    16480   16794   .   +   .   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  exon    16879   17940   .   +   .   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  CDS 16480   16794   .   +   0   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  CDS 16879   17940   .   +   0   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";

Desired output:

SL3.0ch00   maker_ITAG  exon    16480   16794   .   +   .   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  exon    16879   17940   .   +   .   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  CDS 16480   16794   .   +   0   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  CDS 16879   17940   .   +   0   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";

I want to remove the "gene_name "Some name";" filed from all the rows. I used the following:

awk '{prinf$13=$14=""; print $0}' input_file

But all my formattings are getting changed for the first few columns (spaces are coming instead of tabs). Kindly help. Any other command or way to this is also fine.

Upvotes: 1

Answers (3)

Ed Morton

Reputation: 203522

You have some fields separated by tabs and others separated by semi-colons followed by an optional blank. You can tell awk to split on both using FS="\t|; ?" which will correctly identify your fields but the specific separators around each field won't be preserved and you'll need them later to put the record back together. That's why GNU awks split() function was give a 4th arg so it can save both the fields and the separators. In your case you'd use it as:

nf = split($0,flds,/\t|; ?/,seps)

Look at what that does for the first record in your input:

$ cat tst.awk
{
    nf = split($0,flds,/\t|; ?/,seps)
}
NR == 1 {
    printf "$0=<%s>\n", $0
    for (i=1; i<=nf; i++) {
        printf "  flds[%d] = <%s>\n", i, flds[i]
        printf "  seps[%d] = <%s>\n", i, seps[i]
    }
}

$ awk -f tst.awk file
$0=<SL3.0ch00   maker_ITAG      exon    16480   16794   .       +       .       transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_name "Solyc00g005000.3"; gene_biotype "protein_coding";>
  flds[1] = <SL3.0ch00>
  seps[1] = <   >
  flds[2] = <maker_ITAG>
  seps[2] = <   >
  flds[3] = <exon>
  seps[3] = <   >
  flds[4] = <16480>
  seps[4] = <   >
  flds[5] = <16794>
  seps[5] = <   >
  flds[6] = <.>
  seps[6] = <   >
  flds[7] = <+>
  seps[7] = <   >
  flds[8] = <.>
  seps[8] = <   >
  flds[9] = <transcript_id "mRNA:Solyc00g005000.3.1">
  seps[9] = <; >
  flds[10] = <gene_id "gene:Solyc00g005000.3">
  seps[10] = <; >
  flds[11] = <gene_name "Solyc00g005000.3">
  seps[11] = <; >
  flds[12] = <gene_biotype "protein_coding">
  seps[12] = <;>
  flds[13] = <>
  seps[13] = <>

See how not only do you have access to each field in the flds[] array but also the separators around each field in the seps[] array? So to delete a field all you have to do is set the appropriate element in the arrays to null and recombine the record:

$ cat tst.awk
{
    nf = split($0,flds,/\t|; ?/,seps)

    flds[11] = seps[11] = ""

    $0 = join(nf,flds,seps)

    print
}
function join(n,f,s,   i,o) {for (i=1;i<=n;i++) o=o f[i] s[i]; return o}

$ awk -f tst.awk file
SL3.0ch00       maker_ITAG      exon    16480   16794   .       +       .       transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00       maker_ITAG      exon    16879   17940   .       +       .       transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00       maker_ITAG      CDS     16480   16794   .       +       0       transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00       maker_ITAG      CDS     16879   17940   .       +       0       transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";

Upvotes: 1

RavinderSingh13

Reputation: 133518

Could you please try following.(Use -F"\t" in case of your Input_file is TAB delimited)

awk 'match($0,/ gene_name[^;]*/){print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH+1);next} 1' Input_file

Adding a non-one liner form of solution with explanation too now.

awk '
match($0,/ gene_name[^;]*/){                               ##Using match function of awk where checking regex from keyword gene_name till semi colon.
  print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH+1)  ##Printing substring from 1st character to till RSTART-1 and then RSTART+RLENGTH+1 to till last, where RSTART and RLENGTH are out of the box keywords whose value will be SET when a regex match is found in match function.
  next                                                     ##next is out of box keyword which will skip all further statements from here.
}
1                                                          ##Mentioning 1 will print the lines which do not have match of above regex for gene_name one.
' Input_file                                               ##Mentioning Input_file name here.

Upvotes: 1

Cyrus

Reputation: 88636

With awk:

awk 'BEGIN{FS=OFS=";"} {print $1,$2,$4,$5}' file

With sed:

sed 's/gene_name "[^"]*"; //' file

Output:

SL3.0ch00   maker_ITAG  exon    16480   16794   .   +   .   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  exon    16879   17940   .   +   .   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  CDS 16480   16794   .   +   0   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";
SL3.0ch00   maker_ITAG  CDS 16879   17940   .   +   0   transcript_id "mRNA:Solyc00g005000.3.1"; gene_id "gene:Solyc00g005000.3"; gene_biotype "protein_coding";

See: The Stack Overflow Regular Expressions FAQ

Upvotes: 1

How to delete specific columns in a file without any formatting change

Answers (3)

Related Questions