jedberg
jedberg

Reputation: 421

Best way to determine uniqueness and repetition in a text file

I have a text file of about 20 million lines. Each line is 25 characters long. I estimate that there are probably about 200k-300k unique lines. What I want to find out is exactly how many unique lines there are, and how many occurrences of each line there are (I expect the result to be power-law-esque).

I could do this:

sort bigfile|uniq -c |sort -nr > uniqcounts
wc -l uniqcounts

but that is horribly inefficient memory and time-wise.

What is your best command line solution to this problem?

Upvotes: 3

Views: 1398

Answers (5)

Dimitre Radoulov
Dimitre Radoulov

Reputation: 27990

With awk (use nawk or /usr/xpg4/bin/awk on Solaris:

awk 'END {
  for (k in _)
    print k, _[k]
    }
{ _[$0]++ }
' infile

Upvotes: 1

Anonymous
Anonymous

Reputation: 3051

I'm assuming the risk to be considered off-topic and downvoted, but I must rant about this.

20 millions * 25 chars = 500000000 bytes (assuming you don't mean Unicode)

That's less than 500 MB of RAM. That is not a huge number for a modern computer.

Please don't complain about this being horribly inefficient memory and time-wise. The decision to store redundant data in a flat text file was inefficient and wrong.

Use a database (sqlite for example) instead of a flat file.

Use a table like

CREATE TABLE lines (line VARCHAR(25), occurences INTEGER)

to store unique lines and their occurence.

If it's not your app that generates this text file, complain to the developers about it!

Upvotes: 2

pixelbeat
pixelbeat

Reputation: 31708

Ensure you do this before testing your sort and uniq solution:

export LC_ALL=C

It would be good if you could compare this and the perl solution time wise at least.

Upvotes: 1

Commodore Jaeger
Commodore Jaeger

Reputation: 33370

I tend to lean towards Perl when I have text-processing problems like this, especially since Perl is installed on most Unix systems. (You could probably do the same thing with awk, which is probably a bit more available.)

Something like this should do the trick:

#!/usr/bin/perl

while(<>) {
    chomp;
    $lines{$_}++;
}

print "Total unique lines: ", scalar(keys %lines), "\n";
foreach my $line (sort {$lines{$b} <=> $lines{$a}} keys %lines) {
    printf "%6d  %s\n", $lines{$line}, $line;
}

(You could do this as a one-liner, but broken out makes it easier to read.)

This requires O(n) memory for the hash keys, where n is the number of unique lines. Runtime efficiency depends on the hash lookup but will be somewhere between O(n) (if you have no hash collisions) and O(n*log n) (for a balanced tree). The final, optional sort might take O(n^2) in the worst case and might dominate the runtime if the number of unique lines is high.

Upvotes: 6

slacy
slacy

Reputation: 11763

I'm not sure there's a better solution than the one you posted: O(n log(n) + n). The fineal "sort -nr" that you mention isn't strictly necessary given the problem statement but makes the output easier to grok for humans.

I'd be very interested if someone could come up with a solution that's faster than this (in complexity). Of course, writing a special-purpose program to do this same thing would likely be faster than using sort and uniq.

Upvotes: 0

Related Questions