Reputation: 750
I am trying to count the number of distinct values in field 12 of a file using gawk 4.1.4, and also count the number of times each of those values occurs. I have two short programs which are giving me different answers for the first question, and I am at a loss to explain why.
{if(a[$12]++==1){count++}} END {print count}
...gives a result of 435,176, whereas
{a[$12]++} END {for (i in a){count++};print count}
...gives a result of 599,845.
Can you explain this behaviour, and tell me which value is correct? I am running under Windows (ezwinport) and the field separator is tab.
Upvotes: 2
Views: 71
Reputation: 37464
The first one is wrong (logically, not syntactically, thank you for emphasizing the fact, @GeorgeVasiliou), because you need to ++
before ==
: ++a[$1]==1
:
$ awk '{if(++a[$1]==1){count++}} END {print count}' foo
3
Oh yeah, my test foo
:
$ cat foo
1
1
1
2
2
3
Upvotes: 2
Reputation: 85895
Obviously the 2nd one seems right! You already have the
count
stored and you don't need a separate variable
The way you are using the count
to identify the unique occurrence is wrong in both the cases in the sense it is not tracked per unique instance.
Use the value from the array itself.
The logic in deriving count
{if(a[$12]++==1){count++}} END {print count}
is wrong, but the fact what it does is with post-increment operator only when a field in $12
occurs for the second time it is tracked in the count
variable. Hence the lesser count you are seeing in your output.
On the other hand,
{a[$12]++} END {for (i in a){count++};print count}
is almost right, but you don't need a count
variable, you already have it stored as part of the value in the array a
, indexed by the unique value $12
. Doing the above is also the same as
{a[$12}++; next} END {for (i in a) print a[i]}
A small example to demonstrate it,
cat file
1 2 3
1 2 3
1 2 1
1 1 1
2 3 1
3 4 1
assuming I am worried about unique instances and their occurrence count in $2
. Doing your first example,
awk '{if(a[$2]++==1){count++}}END {for (i in a) print i,a[i],count}' file
1 1 1
2 3 1
3 1 1
4 1 1
see the wrong value of count
printed in the last column, if you can see it carefully, the variable is not even keep tracking the count per instance but a common variable for all instances.
The second approach, seemingly looks good, but prints count
as 4
not clear for which instance, assuming multiple instances and their counts could possibly occur. The right way would be to do,
awk '{a[$2]++; next}END {for (i in a) print i,a[i]}' file
1 1
2 3
3 1
4 1
Here instead of count
, the a[i]
holds the unique count occurrence of the each of the unique value from the column 2.
Upvotes: 2