Buthetleon
Buthetleon

Reputation: 1305

Using multicharacter field separator using AWK

I'm having problems with AWK's field delimiter, the input file appears as below

1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria | scientific name |
2 | Monera | Monera | in-part |
2 | Procaryotae | Procaryotae | in-part |
2 | Prokaryota | Prokaryota | in-part |
2 | Prokaryotae | Prokaryotae | in-part |
2 | bacteria | bacteria | blast name |

the field delimiter here is tab,pipe,tab \t|\t so in my attempt to print just the 1st and 2nd column

awk -F'\t|\t' '{print $1 "\t" $2}' nodes.dmp | less

instead of the desired output, the output is the 1st column followed by the pipe character. I tried escaping the pipe \t\|\t, but the output remains the same.

1 |
1 |
2 |
2 |
2 |
2 |

Printing the 1st and 3rd column gave me the original intended output.

awk -F'\t|\t' '{print $1 "\t" $3}' nodes.dmp | less

but i'm puzzed as to why this is not working as intended.

I understand that the perl one liner below will work but what i really want is to use awk.

perl -aln -F"\t\|\t" -e 'print $F[0],"\t",$F[1]' nodes.dmp | less

Upvotes: 4

Views: 6604

Answers (3)

rook
rook

Reputation: 6240

Using cut command:

 cut -f1,2 -d'|' file.txt 

without pipe in output:

 cut -f1,2 -d'|' file.txt | tr -d '|'

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203493

From your posted input:

  1. your lines can end in |, not |\t, and
  2. you have cases (the first 2 lines) where the input contains |\t|, and
  3. your lines start with a tab

So, an FS of tab-pipe-tab is wrong since it won't match any of the above cases since the first is just tab-pipe and the tab in the middle of the second will match the tab-pipe-tab from the preceding field but then that just leaves pipe-tab for the following field, and the first leaves you with an undesirable leading tab.

What you actually need is to set the FS to just tab-pipe and then strip off the leading tab from each field:

awk -F'\t|' -v OFS='\t' '{gsub(/(^|[|])\t/,""); print $1, $2}' file

That way you can handle all fields from 1 to NF-1 exactly the same as each other.

Upvotes: 1

devnull
devnull

Reputation: 123478

The pipe | character seems to be confusing awk into thinking that \t|\t implies that the field separator could be one of \t or \t. Tell awk to interpret the | literally.

$ awk -F'\t[|]\t' '{print $1 "\t" $2}'
1   all
1   root
2   Bacteria
2   Monera
2   Procaryotae
2   Prokaryota
2   Prokaryotae
2   bacteria

Upvotes: 6

Related Questions