Reputation: 20000
I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:
awk '!x[$1]++' filename
It works, but I am not sure how it works. I know it uses associate arrays in awk
but I am not able to infer anything beyond it.
Update:
Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.
Upvotes: 3
Views: 586
Reputation: 41252
That awk script !x[$1]++
fills an array named x
. Suppose the first word ($1
refers to the first word in a line of text) in a line of text is line1
. It effectively results in this operation on the array:
x["line1"]++
The "index" (the key) of the array is the text encountered in the file (line1
in this example), and the value associated with that key is an integer that is incremented by 1.
When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not
operator !
evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not
operation results in zero (false), so the line is not printed.
A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:
{
if (x[$1] == 0 )
print
x[$1]++
}
Upvotes: 5