Reputation: 750

gawk behaviour I don't understand

I am trying to count the number of distinct values in field 12 of a file using gawk 4.1.4, and also count the number of times each of those values occurs. I have two short programs which are giving me different answers for the first question, and I am at a loss to explain why.

{if(a[$12]++==1){count++}} END {print count}

...gives a result of 435,176, whereas

{a[$12]++} END {for (i in a){count++};print count}

...gives a result of 599,845.

Can you explain this behaviour, and tell me which value is correct? I am running under Windows (ezwinport) and the field separator is tab.

Upvotes: 2

Answers (2)

James Brown

Reputation: 37464

The first one is wrong (logically, not syntactically, thank you for emphasizing the fact, @GeorgeVasiliou), because you need to ++ before ==: ++a[$1]==1 :

$ awk '{if(++a[$1]==1){count++}} END {print count}' foo
3

Oh yeah, my test foo:

$ cat foo
1
1
1
2
2
3

Upvotes: 2

Inian

Reputation: 85895

Obviously the 2nd one seems right! You already have the count stored and you don't need a separate variable

The way you are using the count to identify the unique occurrence is wrong in both the cases in the sense it is not tracked per unique instance.

Use the value from the array itself.

The logic in deriving count

{if(a[$12]++==1){count++}} END {print count}

is wrong, but the fact what it does is with post-increment operator only when a field in $12 occurs for the second time it is tracked in the count variable. Hence the lesser count you are seeing in your output.

On the other hand,

{a[$12]++} END {for (i in a){count++};print count}

is almost right, but you don't need a count variable, you already have it stored as part of the value in the array a, indexed by the unique value $12. Doing the above is also the same as

{a[$12}++; next} END {for (i in a) print a[i]}

A small example to demonstrate it,

assuming I am worried about unique instances and their occurrence count in $2. Doing your first example,

awk '{if(a[$2]++==1){count++}}END {for (i in a) print i,a[i],count}' file
1 1 1
2 3 1
3 1 1
4 1 1

see the wrong value of count printed in the last column, if you can see it carefully, the variable is not even keep tracking the count per instance but a common variable for all instances.

The second approach, seemingly looks good, but prints count as 4 not clear for which instance, assuming multiple instances and their counts could possibly occur. The right way would be to do,

awk '{a[$2]++; next}END {for (i in a) print i,a[i]}' file
1 1
2 3
3 1
4 1

Here instead of count, the a[i] holds the unique count occurrence of the each of the unique value from the column 2.

Upvotes: 2

gawk behaviour I don&#39;t understand

Answers (2)

Related Questions

gawk behaviour I don't understand