How to split awk field correctly

Question

I have a file (test.bed) that looks like this (which might not be tab-seperated):

chr1    10002   10116   id=1;frame=0;strand=+;  0   +
chr1    10116   10122   id=2;frame=0;strand=+;  0   +
chr1    10122   10128   id=3;frame=0;strand=+;  0   +
chr1    10128   10134   id=4;frame=0;strand=+;  0   +
chr1    10134   10140   id=5;frame=0;strand=+;  0   +
chr1    10140   10146   id=6;frame=0;strand=+;  0   +
chr1    10146   10182   id=7;frame=0;strand=+;  0   +
chr1    10182   10188   id=8;frame=0;strand=+;  0   +
chr1    10188   10194   id=9;frame=0;strand=+;  0   +
chr1    10194   10200   id=10;frame=0;strand=+; 0   +

I want to produce the following output (which should be tab-seperated):

chr1    10002   10116   id=1    0   +
chr1    10116   10122   id=2    0   +
chr1    10122   10128   id=3    0   +
chr1    10128   10134   id=4    0   +
chr1    10134   10140   id=5    0   +
chr1    10140   10146   id=6    0   +
chr1    10146   10182   id=7    0   +
chr1    10182   10188   id=8    0   +
chr1    10188   10194   id=9    0   +
chr1    10194   10200   id=10   0   +

I have tried with the following code:

awk 'OFS="	" split ($0, a, ";"){print a[1],$5,$6}' test.bed

But then I get:

chr1    10002   10116   id=1    40  4+
chr1    10116   10122   id=2    40  4+
chr1    10122   10128   id=3    40  4+
chr1    10128   10134   id=4    40  4+
chr1    10134   10140   id=5    40  4+
chr1    10140   10146   id=6    40  4+
chr1    10146   10182   id=7    40  4+
chr1    10182   10188   id=8    40  4+
chr1    10188   10194   id=9    40  4+
chr1    10194   10200   id=10   40  4+

What am I doing wrong? Somehow the number '4' is added to the last two fields. I thought the number '4' somehow might have something to do with splitting in the 4th field, however, I tried producing a similar file where it was the 3rd field that was split, and still got the number '4' added to the last two fields. I am rather new to 'awk' so I guess it is an error in the syntax. Any help would be appreciated.

Chris Seymour · Accepted Answer

If you set your field separator as whitespace or semi-columns you won't have to handle the splitting yourself:

$ awk '{print $1,$2,$3,$4,$8,$9}' FS='[[:space:]]+|;' OFS='	' file
chr1    10002   10116   id=1    0   +
chr1    10116   10122   id=2    0   +
chr1    10122   10128   id=3    0   +
chr1    10128   10134   id=4    0   +
chr1    10134   10140   id=5    0   +
chr1    10140   10146   id=6    0   +
chr1    10146   10182   id=7    0   +
chr1    10182   10188   id=8    0   +
chr1    10188   10194   id=9    0   +
chr1    10194   10200   id=10   0   +

As for what you are doing wrong in:

awk 'OFS="	" split ($0, a, ";"){print a[1],$5,$6}'

The syntax of awk is condition{block} and setting the value of OFS and splitting is not a conditional. They are statements that should be inside the block.
However you really don't need to set the value of OFS on every line so it should be initialized only once. You can do this using the -v option, in the BEGIN block or after the script.

Valid alternatives:

$ awk -v OFS='	' '{split($0,a,";");print a[1],$5,$6}' file

$ awk 'BEGIN{OFS="	"}{split($0,a,";");print a[1],$5,$6}' file

$ awk '{split ($0,a,";");print a[1],$5,$6}' OFS='	' file

How to split awk field correctly

Answers (2)

Related Questions