papafe
papafe

Reputation: 3070

Awk counting occurrences strange behaviour

I need to count the number of occurrences of elements of the second column of a large number of files. The script I'm using is this:

{
 el[$2]++
}
END {
    for (i in el) {
    print i, el[i] >> "rank.txt"
    }
 }

For running it over a large number of files I'm using find | xargs this way :

find . -name "*.txt" | xargs awk -f script.awk

The problems is that if I count the number of lines of the output files rank.txt (with a wc -l rank.txt) the number I get (for example 7600) is bigger than the number of unique elements of the second row (for example 7300), that I obtain with a :

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l

In fact giving a :

awk '{print $1}' rank.txt | sort | uniq | wc -l

I obtain the right number of elements (following the example I'll gett 7300). So it means that the elements of the first column of the output files are not unique. But, this shouldn't happen!

Upvotes: 1

Views: 411

Answers (2)

Tomas
Tomas

Reputation: 59545

This is probably combination of the fact that the input files (*.txt) contain non-unique elements, and the xargs functionality. Remember that xargs, when there is a large number of files, is called repeatedly with different set of arguments. This means that in the first example, if there is larger number of files, some of the files are not processed in one awk run, which results in higher number of "unique" elements in the output.

You could try this:

find . -name "*.txt" | xargs cat | awk -f script.awk

Upvotes: 5

shellter
shellter

Reputation: 37298

YOu can find out where the non-duplicates in $1 are by using

find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq -c | awk '$1 > 1 {print}'

I don't have a way to test this right now, the intent of last awk is to filter output of uniq -c to show only records that have a count greater than one.

I hope this helps.

Upvotes: 0

Related Questions