user53416
user53416

Reputation: 13

How to split awk field correctly

I have a file (test.bed) that looks like this (which might not be tab-seperated):

chr1    10002   10116   id=1;frame=0;strand=+;  0   +
chr1    10116   10122   id=2;frame=0;strand=+;  0   +
chr1    10122   10128   id=3;frame=0;strand=+;  0   +
chr1    10128   10134   id=4;frame=0;strand=+;  0   +
chr1    10134   10140   id=5;frame=0;strand=+;  0   +
chr1    10140   10146   id=6;frame=0;strand=+;  0   +
chr1    10146   10182   id=7;frame=0;strand=+;  0   +
chr1    10182   10188   id=8;frame=0;strand=+;  0   +
chr1    10188   10194   id=9;frame=0;strand=+;  0   +
chr1    10194   10200   id=10;frame=0;strand=+; 0   +

I want to produce the following output (which should be tab-seperated):

chr1    10002   10116   id=1    0   +
chr1    10116   10122   id=2    0   +
chr1    10122   10128   id=3    0   +
chr1    10128   10134   id=4    0   +
chr1    10134   10140   id=5    0   +
chr1    10140   10146   id=6    0   +
chr1    10146   10182   id=7    0   +
chr1    10182   10188   id=8    0   +
chr1    10188   10194   id=9    0   +
chr1    10194   10200   id=10   0   +

I have tried with the following code:

awk 'OFS="\t" split ($0, a, ";"){print a[1],$5,$6}' test.bed 

But then I get:

chr1    10002   10116   id=1    40  4+
chr1    10116   10122   id=2    40  4+
chr1    10122   10128   id=3    40  4+
chr1    10128   10134   id=4    40  4+
chr1    10134   10140   id=5    40  4+
chr1    10140   10146   id=6    40  4+
chr1    10146   10182   id=7    40  4+
chr1    10182   10188   id=8    40  4+
chr1    10188   10194   id=9    40  4+
chr1    10194   10200   id=10   40  4+

What am I doing wrong? Somehow the number '4' is added to the last two fields. I thought the number '4' somehow might have something to do with splitting in the 4th field, however, I tried producing a similar file where it was the 3rd field that was split, and still got the number '4' added to the last two fields. I am rather new to 'awk' so I guess it is an error in the syntax. Any help would be appreciated.

Upvotes: 1

Views: 279

Answers (2)

Chris Seymour
Chris Seymour

Reputation: 85765

If you set your field separator as whitespace or semi-columns you won't have to handle the splitting yourself:

$ awk '{print $1,$2,$3,$4,$8,$9}' FS='[[:space:]]+|;' OFS='\t' file
chr1    10002   10116   id=1    0   +
chr1    10116   10122   id=2    0   +
chr1    10122   10128   id=3    0   +
chr1    10128   10134   id=4    0   +
chr1    10134   10140   id=5    0   +
chr1    10140   10146   id=6    0   +
chr1    10146   10182   id=7    0   +
chr1    10182   10188   id=8    0   +
chr1    10188   10194   id=9    0   +
chr1    10194   10200   id=10   0   +

As for what you are doing wrong in:

awk 'OFS="\t" split ($0, a, ";"){print a[1],$5,$6}'
  • The syntax of awk is condition{block} and setting the value of OFS and splitting is not a conditional. They are statements that should be inside the block.
  • However you really don't need to set the value of OFS on every line so it should be initialized only once. You can do this using the -v option, in the BEGIN block or after the script.

Valid alternatives:

$ awk -v OFS='\t' '{split($0,a,";");print a[1],$5,$6}' file

$ awk 'BEGIN{OFS="\t"}{split($0,a,";");print a[1],$5,$6}' file

$ awk '{split ($0,a,";");print a[1],$5,$6}' OFS='\t' file

Upvotes: 1

Sidharth C. Nadhan
Sidharth C. Nadhan

Reputation: 2243

Try this :

awk -F\; '{print $1,$4}' test.bed

Upvotes: 1

Related Questions