mamcx
mamcx

Reputation: 16186

How can I sort an iterator without putting it all in a vector?

I'm building a generic interface similar to generators that stream data from a stream to another, to eventually do things like:

file |> toCsv |> filter |> sort |> filter...

I know how to sort a vector/slice, but how can I sort from an incoming stream/iterator without putting it all in a vector?

stream.iter().collect_sorted()

I need to fuse vectors, trees, files, databases, etc., so sometimes I don't know how large the incoming data is without consuming it all.

I'm not against storing the results. The problem is that sorting is tied to slices/vector. I need to be able to do:

datasource |> Algo.sort |> next...

instead of:

let data = datasource |> into_vec
data.sort()
data |> next...

Different sorting algorithms exist for different use cases, so eventually I wish to apply the best for the data at hand:

datasource |> Algo.MergeSort |> next...
datasource |> Algo.BubbleSort |> next...

Upvotes: 25

Views: 50943

Answers (1)

Shepmaster
Shepmaster

Reputation: 430506

It is literally impossible to sort a set of values without having all of the data. For example, if the iterator had 1 billion instances of 1 followed by a single 0, you simply wouldn't know that the zero needed to go first until you got there. You may wish to re-acquaint yourself with the concept of on- and offline algorithms.

without putting it all in a vector

That's easy: don't use a vector, use any type that implements FromIterator. For example, you could collect into a BinaryHeap:

use std::{collections::BinaryHeap, iter};

fn main() {
    let a_lot_of_numbers = iter::repeat(1).take(100).chain(iter::once(0));
    let data: BinaryHeap<_> = a_lot_of_numbers.collect();
}

Whether this is a good idea or not depends entirely on your case.

If you just don't want to see the vector or just wish to preserve the chaining, then I'd suggest using Itertools::sorted. This uses a Vec internally, meaning that all of the data is stored in memory before the first value is returned:

use itertools::Itertools; // 0.8.0
use std::iter;

fn main() {
    let a_lot_of_numbers = iter::repeat(1).take(100).chain(iter::once(0));

    for v in a_lot_of_numbers.sorted() {
        println!("{}", v);
    }
}

This is a common problem with databases, where is not wise to load all the data then sort

Databases are amazingly complex pieces of software with years of effort put into them considering carefully weighed tradeoffs. You won't find that grade of algorithm lying about in a package manager. Even if you could, databases don't always get it right, requiring skilled programmers to tweak queries to perform better. All you need to know about sorting in Postgres covers a good set of what Postgres can do.

It should be theoretically possible to write an iterator adapter that writes all of the data to disk, performs sorting there, then re-reads the data from the disk. This is called external sorting.

Upvotes: 44

Related Questions