Reputation: 58471

How to sort an array of integers faster than quicksort?

Sorting an array of integers with numpy's quicksort has become the bottleneck of my algorithm. Unfortunately, numpy does not have radix sort yet. Although counting sort would be a one-liner in numpy:

np.repeat(np.arange(1+x.max()), np.bincount(x))

see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers in my application can run from 0 to 2**32.

Am I stuck with quicksort?

_{This post was primarily motivated by the
[Numpy grouping using itertools.groupby performance](https://stackoverflow.com/q/4651683/341970)
question.
Also note that
[it is not merely OK to ask and answer your own question, it is explicitly encouraged.](https://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/)}

Upvotes: 11

Answers (3)

B. M.

Reputation: 18638

A radix-sort with python/numba(0.23), according to @rcgldr C program, with multithread on a 2 cores processor.

first the radix sort on numba, with two global arrays for efficiency.

from threading import Thread
from pylab import *
from numba import jit
n=uint32(10**8)
m=n//2
if 'x1'  not in locals() : x1=array(randint(0,1<<16,2*n),uint16); #to avoid regeneration
x2=x1.copy()
x=frombuffer(x2,uint32) # randint doesn't work with 32 bits here :(
y=empty_like(x) 
nbits=8
buffsize=1<<nbits
mask=buffsize-1

@jit(nopython=True,nogil=True)
def radix(x,y):
    xs=x.size
    dec=0
    while dec < 32 :
        u=np.zeros(buffsize,uint32)
        k=0
        while k<xs:
            u[(x[k]>>dec)& mask]+=1
            k+=1
        j=t=0
        for j in range(buffsize):
            b=u[j]
            u[j]=t
            t+=b
            v=u.copy()
        k=0
        while k<xs:
            j=(x[k]>>dec)&mask
            y[u[j]]=x[k]
            u[j]+=1
            k+=1
        x,y=y,x
        dec+=nbits

then the parallélisation, possible with nogil option.

def para(nthreads=2):
        threads=[Thread(target=radix,
            args=(x[i*n//nthreads(i+1)*n//nthreads],
            y[i*n//nthreads:(i+1)*n//nthreads])) 
            for i in range(nthreads)]
        for t in  threads: t.start()
        for t in  threads: t.join()

@jit
def fuse(x,y):
    kl=0
    kr=n//2
    k=0
    while k<n:
        if y[kl]<x[kr] :
            x[k]=y[kl]
            kl+=1
            if kl==m : break
        else :
            x[k]=x[kr]
            kr+=1
        k+=1

def sort():
    para(2)
    y[:m]=x[:m]
    fuse(x,y)

benchmarks :

In [24]: %timeit x2=x1.copy();x=frombuffer(x2,uint32) # time offset
1 loops, best of 3: 431 ms per loop

In [25]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);x.sort()
1 loops, best of 3: 37.8 s per loop

In [26]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);para(1)
1 loops, best of 3: 5.7 s per loop

In [27]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);sort()
1 loops, best of 3: 4.02 s per loop

So a pure python solution with a 10x (37s->3.5s) gain on my poor 1GHz machine. Can be enhanced with more cores and multifusion.

Upvotes: 2

rcgldr

Reputation: 28826

A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.

Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).

Example code:

//  a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}

Upvotes: 4

Ali

Reputation: 58471

No, you are not stuck with quicksort. You could use, for example, integer_sort from Boost.Sort or u4_sort from usort. When sorting this array:

array(randint(0, high=1<<32, size=10**8), uint32)

I get the following results:

NumPy quicksort:         8.636 s  1.0  (baseline)
Boost.Sort integer_sort: 4.327 s  2.0x speedup
usort u4_sort:           2.065 s  4.2x speedup

I would not jump to conclusions based on this single experiment and use usort blindly. I would test with my actual data and measure what happens. Your mileage will vary depending on your data and on your machine. The integer_sort in Boost.Sort has a rich set of options for tuning, see the documentation.

Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.

Boost.Sort

Put these lines into the spreadsort.cpp file:

#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;

extern "C" {
    void spreadsort(std::uint32_t* begin,  std::size_t len) {
        integer_sort(begin, begin + len);
    }
}

It basically instantiates the templated integer_sort for 32 bit unsigned integers; the extern "C" part ensures C linkage by disabling name mangling. Assuming you are using gcc and that the necessary include files of boost are under the /tmp/boost_1_60_0 directory, you can compile it:

g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so

The key flags are -fPIC to generate position-independet code and -shared to generate a shared object .so file. (Read the docs of gcc for further details.)

Then, you wrap the spreadsort() C++ function in Python using ctypes:

from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer

__all__ = ['integer_sort']

# In spreadsort.cpp: void spreadsort(std::uint32_t* begin,  std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    sort(arr, arr.size)

Alternatively, you can use cffi:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin,  size_t len);')
C = ffi.dlopen('./spreadsort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('uint32_t*', arr.ctypes.data)
    C.spreadsort(begin, arr.size)

At the cdll.LoadLibrary() and ffi.dlopen() calls I assumed that the path to the spreadsort.so file is ./spreadsort.so. Alternatively, you can write

lib = cdll.LoadLibrary('spreadsort.so')

C = ffi.dlopen('spreadsort.so')

if you append the path to spreadsort.so to the LD_LIBRARY_PATH environment variable. See also Shared Libraries.

Usage. In both cases you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.

usort

As for u4_sort, you can compile it as follows:

cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so

Issue this command in the directory where the u4_sort.c file is located. (Probably there is a less hackish way but I failed to figure that out. I just looked into the deps.mk file in the usort directory to find out the necessary compiler flags and include paths.)

Then, you can wrap the C function as follows:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('unsigned*', arr.ctypes.data)
    C.u4_sort(begin, arr.size)

In the above code, I assumed that the path to u4_sort.so has been appended to the LD_LIBRARY_PATH environment variable.

Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.

Upvotes: 14

How to sort an array of integers faster than quicksort?

Answers (3)

Related Questions