How does associative arrays work in awk?

Question

I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:

awk '!x[$1]++' filename

It works, but I am not sure how it works. I know it uses associate arrays in awk but I am not able to infer anything beyond it.

Update:

Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.

Mark Wilkins · Accepted Answer

That awk script !x[$1]++ fills an array named x. Suppose the first word ($1 refers to the first word in a line of text) in a line of text is line1. It effectively results in this operation on the array:

x["line1"]++

The "index" (the key) of the array is the text encountered in the file (line1 in this example), and the value associated with that key is an integer that is incremented by 1.

When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not operator ! evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not operation results in zero (false), so the line is not printed.

A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:

{
if (x[$1] == 0 ) 
   print
x[$1]++
}

How does associative arrays work in awk?

Answers (1)

Related Questions