Reputation: 421
I have a text file of about 20 million lines. Each line is 25 characters long. I estimate that there are probably about 200k-300k unique lines. What I want to find out is exactly how many unique lines there are, and how many occurrences of each line there are (I expect the result to be power-law-esque).
I could do this:
sort bigfile|uniq -c |sort -nr > uniqcounts
wc -l uniqcounts
but that is horribly inefficient memory and time-wise.
What is your best command line solution to this problem?
Upvotes: 3
Views: 1398
Reputation: 27990
With awk (use nawk or /usr/xpg4/bin/awk on Solaris:
awk 'END {
for (k in _)
print k, _[k]
}
{ _[$0]++ }
' infile
Upvotes: 1
Reputation: 3051
I'm assuming the risk to be considered off-topic and downvoted, but I must rant about this.
20 millions * 25 chars = 500000000 bytes (assuming you don't mean Unicode)
That's less than 500 MB of RAM. That is not a huge number for a modern computer.
Please don't complain about this being horribly inefficient memory and time-wise. The decision to store redundant data in a flat text file was inefficient and wrong.
Use a database (sqlite for example) instead of a flat file.
Use a table like
CREATE TABLE lines (line VARCHAR(25), occurences INTEGER)
to store unique lines and their occurence.
If it's not your app that generates this text file, complain to the developers about it!
Upvotes: 2
Reputation: 31708
Ensure you do this before testing your sort
and uniq
solution:
export LC_ALL=C
It would be good if you could compare this and the perl solution time wise at least.
Upvotes: 1
Reputation: 33370
I tend to lean towards Perl when I have text-processing problems like this, especially since Perl is installed on most Unix systems. (You could probably do the same thing with awk, which is probably a bit more available.)
Something like this should do the trick:
#!/usr/bin/perl
while(<>) {
chomp;
$lines{$_}++;
}
print "Total unique lines: ", scalar(keys %lines), "\n";
foreach my $line (sort {$lines{$b} <=> $lines{$a}} keys %lines) {
printf "%6d %s\n", $lines{$line}, $line;
}
(You could do this as a one-liner, but broken out makes it easier to read.)
This requires O(n) memory for the hash keys, where n is the number of unique lines. Runtime efficiency depends on the hash lookup but will be somewhere between O(n) (if you have no hash collisions) and O(n*log n) (for a balanced tree). The final, optional sort might take O(n^2) in the worst case and might dominate the runtime if the number of unique lines is high.
Upvotes: 6
Reputation: 11763
I'm not sure there's a better solution than the one you posted: O(n log(n) + n). The fineal "sort -nr" that you mention isn't strictly necessary given the problem statement but makes the output easier to grok for humans.
I'd be very interested if someone could come up with a solution that's faster than this (in complexity). Of course, writing a special-purpose program to do this same thing would likely be faster than using sort and uniq.
Upvotes: 0