Counting top repeated lines in log

Question

Reading the title will lead you to think, I saw this question a hundred times, and you did, but I'm looking for something different:

The common answer is

sort  | uniq -c | sort -nr

But when the input is tens of millions of lines, sort becomes impractical. Sort is an O(n log(n)) algorithm.It can be palatalized, but it still requires O(n) amount of memory.

I am looking for an algorithm that can do this counting much better: using the following assumptions: the number of types of log messages is much smaller then the log files (thousands). I am interested in the top 50 recurring messages.

Counting top repeated lines in log

Answers (1)

Related Questions