incorrect count of unique text in awk

Question

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.

input

chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    1   15
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    2   16
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    3   16
chr1    1267394 1268196 chr1:1267394-1268196    TAS1R3-46|gc=68.2   553 567
chr1    1267394 1268196 chr1:1267394-1268196    TAS1R3-46|gc=68.2   554 569
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  46  203
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  47  206
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  48  206
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  49  207

current output

desired output (AGRN,TAS1R3, PIK3CD) are unique and counted

awk

awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file

mklement0 · Accepted Answer

Try

awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file

Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators. That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.

Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.

You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:

$ awk -F '[- ]' '{ print NF }' file
17  # !! 8 extra fields - empty fields

$ awk -F '-| +' '{ print NF }' file
9   # OK, thanks to modified regex

You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'

incorrect count of unique text in awk

Answers (2)

Related Questions