Reputation: 3022
I am getting the wrong counts using the awk
below. The unique text in $5
before the -
is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file
Upvotes: 3
Views: 85
Reputation: 116880
Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2
Upvotes: 2
Reputation: 439417
Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' '
(a space) is included as part of a regex to form FS
(via -F
) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ]
won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| +
to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'
Upvotes: 7