dbbd
dbbd

Reputation: 894

Counting top repeated lines in log

Reading the title will lead you to think, I saw this question a hundred times, and you did, but I'm looking for something different:

The common answer is

sort <input> | uniq -c | sort -nr

But when the input is tens of millions of lines, sort becomes impractical. Sort is an O(n log(n)) algorithm.It can be palatalized, but it still requires O(n) amount of memory.

I am looking for an algorithm that can do this counting much better: using the following assumptions: the number of types of log messages is much smaller then the log files (thousands). I am interested in the top 50 recurring messages.

Upvotes: 1

Views: 108

Answers (1)

chepner
chepner

Reputation: 531460

You can use awk to implement a simple type of bucket sort:

awk 'a[$0]++; END {for (line in a) { print a[line], line; }}' |  sort -k1,1nr | head -50

The awk command counts occurrences of each unique line and outputs each line with its count, in O(n) time. The sort then simply sorts the output by count in reverse numerical order, and head outputs the 50 largest.

Upvotes: 1

Related Questions