justaguy
justaguy

Reputation: 3022

incorrect count of unique text in awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.

input

chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    1   15
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    2   16
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    3   16
chr1    1267394 1268196 chr1:1267394-1268196    TAS1R3-46|gc=68.2   553 567
chr1    1267394 1268196 chr1:1267394-1268196    TAS1R3-46|gc=68.2   554 569
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  46  203
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  47  206
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  48  206
chr1    9781175 9781316 chr1:9781175-9781316    PIK3CD-276|gc=63.1  49  207

current output

1

desired output (AGRN,TAS1R3, PIK3CD) are unique and counted

3

awk

awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file

Upvotes: 3

Views: 85

Answers (2)

peak
peak

Reputation: 116880

Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.

awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
     END {print n}'

If you want the details:

awk '{split($5, a, "-"); count[a[1]]++}
     END { for(i in count) {print i, count[i]}}'

Output of the second incantation:

AGRN 3
PIK3CD 4
TAS1R3 2

Upvotes: 2

mklement0
mklement0

Reputation: 439417

Try

awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file

Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators. That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.

Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.

You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:

$ awk -F '[- ]' '{ print NF }' file
17  # !! 8 extra fields - empty fields

$ awk -F '-| +' '{ print NF }' file
9   # OK, thanks to modified regex

You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'

Upvotes: 7

Related Questions