Reputation: 894
Reading the title will lead you to think, I saw this question a hundred times, and you did, but I'm looking for something different:
The common answer is
sort <input> | uniq -c | sort -nr
But when the input is tens of millions of lines, sort becomes impractical. Sort is an O(n log(n)) algorithm.It can be palatalized, but it still requires O(n) amount of memory.
I am looking for an algorithm that can do this counting much better: using the following assumptions: the number of types of log messages is much smaller then the log files (thousands). I am interested in the top 50 recurring messages.
Upvotes: 1
Views: 108
Reputation: 531460
You can use awk
to implement a simple type of bucket sort:
awk 'a[$0]++; END {for (line in a) { print a[line], line; }}' | sort -k1,1nr | head -50
The awk
command counts occurrences of each unique line and outputs each line with its count, in O(n) time. The sort
then simply sorts the output by count in reverse numerical order, and head
outputs the 50 largest.
Upvotes: 1