Reputation: 54591

alternative for command line "sort | uniq -c | sort -n"

I use sort | uniq -c | sort -n for years but today it fails as my input file is 10 GB and my /tmp is 1 GB wide:

sort: write failed: /tmp/sortmIGbL: No space left on device

Therefore I am looking for an efficient alternative for everyday use:

awk may be used but there is no sorted associative array

perl seems to be a good option but the 10-years-old solution from perlmonks.org does not seem to work

no warnings;
$^W=0;
open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my ($buffer, %h) = ''; keys %h = 1024*500;
while (sysread($in, $buffer, 16384, length $buffer)) {
    $h{$1}++ while $buffer =~ m[^(?:.+?\|){9}([^|]+)\|]mg;
    $buffer = substr($buffer, rindex($buffer, "\n"));
}
print scalar keys %h;

How to get the same result as `sort | uniq -c | sort -nr | head` on very large files?

As I use Linux/Cygwin/Solaris/*BSD/... I am open to any idea (portable or not)
You are free to use the scripting language you want (awk/perl/...)

input example

a
BB
ccccc
dddddddd
a
BB
a

one of the possible outputs

    3 a
    2 BB
    1 dddddddd
    1 ccccc

Upvotes: 3

Answers (3)

Ed Morton

Reputation: 203807

With GNU awk for sorted associative arrays:

$ gawk '
    BEGIN{ PROCINFO["sorted_in"] = "@val_num_desc" }
    { a[$0]++ }
    END { for (i in a) print a[i], i }
' file
3 a
2 BB
1 dddddddd
1 ccccc

No idea if it'll work efficiently enough for your large data set, just showing an awk sorted associative array as requested in the OPs comments below his question.

Upvotes: 3

ThisSuitIsBlackNot

Reputation: 24073

The first sort in your chain of commands is the one using all the resources. Reduce the problem set by getting the unique lines first, then sorting:

perl -ne '
    $count{$_}++;
    END {
        print "$count{$_} $_" for sort {
            $count{$b} <=> $count{$a} || $b cmp $a
        } keys %count
    }
' input.txt

You have 66,000 unique lines of 7 bytes, so you the memory taken up by the hash keys is going to be 66,000 * 56 bytes for each of those scalars = 3,696,000 bytes for the keys. That doesn't include the counts and the overhead of the hash, but there's no doubt this approach will easily do the trick.

Upvotes: 5

Steffen Ullrich

Reputation: 123415

Sorting is not a sequential operation, e.g. you cannot just read 10 records in, sort them, forward them and then do the next 10 records. So if you want to sort 10GB of data you either

need lots of memory, e.g. way more then 10GB
need lots of disk space (at least 10GB) or sort in-place, e.g. inside the file (this will work for fixed-size records, but will be a nightmare for variable sized records)
need a smarter approach to your problem (e.g. if the record size is 1MB but only 10 bytes of these are relevant for sorting you can be faster and use less memory with a smart algorithm)

BTW, did you try to set TMPDIR so that sort does not use /tmp but /var/tmp or any other directory with more disk space? Or maybe your sort has a -T option to specify the tempdir.

Upvotes: 3

alternative for command line &quot;sort | uniq -c | sort -n&quot;

How to get the same result as sort | uniq -c | sort -nr | head on very large files?

input example

one of the possible outputs

Answers (3)

Related Questions

alternative for command line "sort | uniq -c | sort -n"

How to get the same result as `sort | uniq -c | sort -nr | head` on very large files?