Reputation: 3070
I need to count the number of occurrences of elements of the second column of a large number of files. The script I'm using is this:
{
el[$2]++
}
END {
for (i in el) {
print i, el[i] >> "rank.txt"
}
}
For running it over a large number of files I'm using find | xargs
this way :
find . -name "*.txt" | xargs awk -f script.awk
The problems is that if I count the number of lines of the output files rank.txt
(with a wc -l rank.txt
) the number I get (for example 7600) is bigger than the number of unique elements of the second row (for example 7300), that I obtain with a :
find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq | wc -l
In fact giving a :
awk '{print $1}' rank.txt | sort | uniq | wc -l
I obtain the right number of elements (following the example I'll gett 7300). So it means that the elements of the first column of the output files are not unique. But, this shouldn't happen!
Upvotes: 1
Views: 411
Reputation: 59545
This is probably combination of the fact that the input files (*.txt
) contain non-unique elements, and the xargs
functionality.
Remember that xargs, when there is a large number of files, is called repeatedly with different set of arguments. This means that in the first example, if there is larger number of files, some of the files are not processed in one awk run, which results in higher number of "unique" elements in the output.
You could try this:
find . -name "*.txt" | xargs cat | awk -f script.awk
Upvotes: 5
Reputation: 37298
YOu can find out where the non-duplicates in $1 are by using
find . -name "*.txt" | xargs awk '{print $2}' | sort | uniq -c | awk '$1 > 1 {print}'
I don't have a way to test this right now, the intent of last awk is to filter output of uniq -c
to show only records that have a count greater than one.
I hope this helps.
Upvotes: 0