Reputation: 731

faster string sorting with long common prefix?

I have a set of strings. 90% of them are URLs start with "http://www.". I want to sort them alphabetically.

Currently I use C++ std::sort(). but std::sort is a variant of quick-sort based on comparison, and comparing two strings with long common prefix is not effecient. However (I think) a radix-sort won't work either, since most strings are put in the same bucket because of long common prefix.

Is there any better algorithm than normal quick-sort/radix-sort for this problem?

Upvotes: 9

Answers (6)

KWillets

Reputation: 115

The paper Engineering Parallel String Sorting has, surprisingly, benchmarks of a large number of single-threaded algorithms on an URL dataset (see page 29). Rantala's variant of multikey quicksort with caching comes out ahead; you can test multikey_cache8 in this repository cited in the paper.

I've tested the dataset in that paper, and if it's any indication you're seeing barely one bit of entropy in the first ten characters, and distinguishing prefixes in the 100-character range. Doing 100 passes of radix sort will thrash the cache with little benefit, eg sorting a million urls means you're looking for ~20 distinguishing bits in each key at a cost of ~100 cache misses.

While in general radix sort will not perform well on long strings , the optimizations that Kärkkäinen and Rantala describe in Engineering Radix Sort for Strings are sufficient for the URL dataset. In particular they read ahead 8 characters and cache them with the string pointers; sorting on these cached values yields enough entropy to get past the cache-miss problem.

For longer strings try some of the LCP-based algorithms in that repository; in my experience the URL dataset is about at the break-even point between highly-optimized radix-type sorts and LCP-based algorithms which do asymptotically better on longer strings.

Upvotes: 1

richselian

Reputation: 731

At last I found a Ternary Quick Sort works well. I found the algorithm at www.larsson.dogma.net/qsufsort.c.

Here is my modified implementation, with similar interface to std::sort. It's about 40% faster than std::sort on my machine and dataset.

#include <iterator>

template<class RandIt> static inline void multiway_qsort(RandIt beg, RandIt end, size_t depth = 0, size_t step_len = 6) {
    if(beg + 1 >= end) {
        return;
    }

    struct { /* implement bounded comparing */
        inline int operator() (
                const typename std::iterator_traits<RandIt>::value_type& a,
                const typename std::iterator_traits<RandIt>::value_type& b, size_t depth, size_t step_len) {

            for(size_t i = 0; i < step_len; i++) {
                if(a[depth + i] == b[depth + i] && a[depth + i] == 0) return 0;
                if(a[depth + i] <  b[depth + i]) return +1;
                if(a[depth + i] >  b[depth + i]) return -1;
            }
            return 0;
        }
    } bounded_cmp;

    RandIt i = beg;
    RandIt j = beg + std::distance(beg, end) / 2;
    RandIt k = end - 1;

    typename std::iterator_traits<RandIt>::value_type key = ( /* median of l,m,r */
            bounded_cmp(*i, *j, depth, step_len) > 0 ?
            (bounded_cmp(*i, *k, depth, step_len) > 0 ? (bounded_cmp(*j, *k, depth, step_len) > 0 ? *j : *k) : *i) :
            (bounded_cmp(*i, *k, depth, step_len) < 0 ? (bounded_cmp(*j, *k, depth, step_len) < 0 ? *j : *k) : *i));

    /* 3-way partition */
    for(j = i; j <= k; ++j) {
        switch(bounded_cmp(*j, key, depth, step_len)) {
            case +1: std::iter_swap(i, j); ++i;      break;
            case -1: std::iter_swap(k, j); --k; --j; break;
        }
    }
    ++k;

    if(beg + 1 < i) multiway_qsort(beg, i, depth, step_len); /* recursively sort [x > pivot] subset */
    if(end + 1 > k) multiway_qsort(k, end, depth, step_len); /* recursively sort [x < pivot] subset */

    /* recursively sort [x == pivot] subset with higher depth */
    if(i < k && (*i)[depth] != 0) {
        multiway_qsort(i, k, depth + step_len, step_len);
    }
    return;
}

Upvotes: 2

rici

Reputation: 241911

If you figure out the minimum and maximum values in the vector before you start quicksorting, then you always know the range of values for each call to partition(), since the partition value is either the minimum or the maximum (or at least close to the minimum/maximum) of each subrange and the containing partition's minimum and maximum are the other end of each subrange.

If the minimum and the maximum of a subrange share a common prefix, then you can do all of the partition comparisons starting from the character position following the common prefix. As the quicksort progresses, the ranges get smaller and smaller so their common prefixes should get longer and longer, and ignoring them for the comparisons will save more and more time. How much, I don't know; you'd have to benchmark this to see if it actually helps.

In any event, the additional overhead is fairly small; one pass through the vector to find the minim and maximum string, costing 1.5 comparisons per string (*), and then one check for each partition to find the maximum shared prefix for the partition; the check is equivalent to a comparison, and it can start from the maximum shared prefix of the containing prefix, so it's not even a full string comparison.

The min/max algorithm: Scan the vector two elements at a time. For each pair, first compare them with each other, then compare the smaller one with the running minimum and the larger one with the running maximum. Result: three comparisons for two elements, or 1.5 comparisons per element.

Upvotes: 2

ElKamina

Reputation: 7817

Create two groups: the ones with the prefix and the ones without. For the first set remove the prefix, sort and add the prefix back. For the second set just sort. After that divide the second set into before prefix and after prefix. Now concatenate the three lists (list_2_before, list_1, list_2_after).

Instead of removing and adding prefix for the first list, you can write your own custom code that would start comparing after the prefix (i.e. ignore the prefix part while comparing).

Addendum: If you have multiple common prefixes you can use them further to speed up. It is better to create a shallow tree with very common prefixes and join them.

Upvotes: 2

Timothy Shields

Reputation: 79581

I would suspect that the processing time you spend trying to exploit common prefixes on the order of 10 characters per URL doesn't even pay for itself when you consider the average length of URLs.

Just try a completely standard sort. If that's not fast enough, look at parallelizing or distributing a completely standard sort. It's a straightforward approach that will work.

Upvotes: 4

Anil Vaitla

Reputation: 2978

Common Prefixes seem to naturally imply that a trie data structure could be useful. So the idea is to build a trie of all the words and then sort each node. The ordering should be that the children of a particular node reside in a list and are sorted. This can be done easily since at a particular node we need only sort the children, so naturally a recursive solution reveals itself. See this for more inspiration: http://goanna.cs.rmit.edu.au/~jz/fulltext/acsc03sz.pdf

Upvotes: 3

faster string sorting with long common prefix?

Answers (6)

Related Questions