jdowdell
jdowdell

Reputation: 1638

What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

I basically want the equivalent of

... | sort -arg1 -arg2 -... | head -n $k

but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.

I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.

Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...

Upvotes: 7

Views: 2593

Answers (3)

jdowdell
jdowdell

Reputation: 1638

Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.

I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:

  • 256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines separated by \n, lines of format [^\t]*\t[0-9]+
  • Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
  • $ time sort -t^v<tab> -k2,2n foo* | tail -10000
    • real 7m26.444s
    • user 7m19.790s
    • sys 0m17.530s
  • $ time python test.py 10000 foo*
    • real 1m29.935s
    • user 1m28.640s
    • sys 0m1.220s
  • using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.

test.py:

#!/usr/bin/env python
# test.py

from sys import argv
import heapq
from itertools import chain

# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]

def line_iterator_to_tuple_iterator(line_i):
    for line in line_i:
        s,c = line.split("\t")
        c = int(c)
        yield (c, s)

# use heap to process inputs
rez = heapq.nlargest(N,
               line_iterator_to_tuple_iterator(chain(*streams)),
               key=lambda x: x[0])

for r in rez:
    print "%s\t%s" % (r[1], r[0])

for s in streams:
    s.close()

Upvotes: 2

Keith Thompson
Keith Thompson

Reputation: 263277

Here's a crude partial solution:

#!/usr/bin/perl

use strict;
use warnings;

my @lines = ();

while (<>) {
    push @lines, $_;
    @lines = sort @lines;
    if (scalar @lines > 10) {
        pop @lines;
    }
}
print @lines;

It reads the input data only once, continuously maintaining a sorted array of the top 10 lines.

Sorting the whole array every time is inefficient, of course, but I'll guess that for a gigabyte input it will still be substantially faster than sort huge-file | head.

Adding an option to vary the number of lines printed would be easy enough. Adding options to control how the sorting is done would be a bit more difficult, though I wouldn't be surprised if there's something in CPAN that would help with that.

More abstractly, one approach to getting just the first N sorted elements from a large array is to use a partial Quicksort, where you don't bother sorting the right partition unless you need to. That requires holding the entire array in memory, which is probably impractical in your case.

You could split the input into medium-sized chunks, apply some clever algorithm to get the top N lines of each chunk, concatenate the chunks together, then apply the same algorithm to the result. Depending on the sizes of the chunks, sort ... | head might be sufficiently clever. It shouldn't be difficult to throw together a shell script using split -l ... to do this.

(Insert more hand-waving as needed.)

Disclaimer: I just tried this on a much smaller file than what you're working with (about 1.7 million lines), and my method was slower than sort ... | head.

Upvotes: 0

jim mcnamara
jim mcnamara

Reputation: 16389

UNIX/Linux provides generalists toolset. For large datasets it does loads of I/O. It will do everything you can want, but slowly. If we had an idea of the input data it would help immensely.

IMO, You have some choices, none you will really like.

  1. do a multipart "radix" pre-sort - for example have awk write all of the lines whose keys start with 'A' to one file 'B' to another, etc. Or if you only 'P' 'D' & 'Q', have awk just suck out what you want. Then do a full sort on a small subset. This creates 26 files named A, B ...Z

    awk '{print $0 > substr($0,1,1)} bigfile; sort [options here] P D Q > result

  2. Spend $$: (Example) Buy CoSort from iri.com any other sort software. These sorts use all kinds of optimizations, but they are not free like bash. You could also buy an SSD which speeds up sorting on disk by several orders of magnitude. 5000iops now to 75000iops. Use the TMPDIR variable to put your tmp files on the SSD, read and write only to the SSD. But use your existing UNIX toolset.

  3. Use some software like R or strata, or preferably a database; all of these are meant for large datasets.

  4. Do what you are doing now, but watch youtube while the UNIX sort runs.

IMO, you are using the wrong tools for large datasets when you want quick results.

Upvotes: 1

Related Questions