Reputation: 58471
Sorting an array of integers with numpy's quicksort has become the bottleneck of my algorithm. Unfortunately, numpy does not have radix sort yet. Although counting sort would be a one-liner in numpy:
np.repeat(np.arange(1+x.max()), np.bincount(x))
see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers
in my application can run from 0
to 2**32
.
Am I stuck with quicksort?
Upvotes: 11
Views: 2366
Reputation: 18638
A radix-sort with python/numba(0.23)
, according to @rcgldr C program, with multithread on a 2 cores processor.
first the radix sort on numba, with two global arrays for efficiency.
from threading import Thread
from pylab import *
from numba import jit
n=uint32(10**8)
m=n//2
if 'x1' not in locals() : x1=array(randint(0,1<<16,2*n),uint16); #to avoid regeneration
x2=x1.copy()
x=frombuffer(x2,uint32) # randint doesn't work with 32 bits here :(
y=empty_like(x)
nbits=8
buffsize=1<<nbits
mask=buffsize-1
@jit(nopython=True,nogil=True)
def radix(x,y):
xs=x.size
dec=0
while dec < 32 :
u=np.zeros(buffsize,uint32)
k=0
while k<xs:
u[(x[k]>>dec)& mask]+=1
k+=1
j=t=0
for j in range(buffsize):
b=u[j]
u[j]=t
t+=b
v=u.copy()
k=0
while k<xs:
j=(x[k]>>dec)&mask
y[u[j]]=x[k]
u[j]+=1
k+=1
x,y=y,x
dec+=nbits
then the parallélisation, possible with nogil
option.
def para(nthreads=2):
threads=[Thread(target=radix,
args=(x[i*n//nthreads(i+1)*n//nthreads],
y[i*n//nthreads:(i+1)*n//nthreads]))
for i in range(nthreads)]
for t in threads: t.start()
for t in threads: t.join()
@jit
def fuse(x,y):
kl=0
kr=n//2
k=0
while k<n:
if y[kl]<x[kr] :
x[k]=y[kl]
kl+=1
if kl==m : break
else :
x[k]=x[kr]
kr+=1
k+=1
def sort():
para(2)
y[:m]=x[:m]
fuse(x,y)
benchmarks :
In [24]: %timeit x2=x1.copy();x=frombuffer(x2,uint32) # time offset
1 loops, best of 3: 431 ms per loop
In [25]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);x.sort()
1 loops, best of 3: 37.8 s per loop
In [26]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);para(1)
1 loops, best of 3: 5.7 s per loop
In [27]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);sort()
1 loops, best of 3: 4.02 s per loop
So a pure python solution with a 10x (37s->3.5s) gain on my poor 1GHz machine. Can be enhanced with more cores and multifusion.
Upvotes: 2
Reputation: 28826
A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.
Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).
Example code:
// a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0}; // count / index matrix
size_t i,j,m,n;
uint32_t u;
for(i = 0; i < count; i++){ // generate histograms
u = a[i];
for(j = 0; j < 4; j++){
mIndex[j][(size_t)(u & 0xff)]++;
u >>= 8;
}
}
for(j = 0; j < 4; j++){ // convert to indices
m = 0;
for(i = 0; i < 256; i++){
n = mIndex[j][i];
mIndex[j][i] = m;
m += n;
}
}
for(j = 0; j < 4; j++){ // radix sort
for(i = 0; i < count; i++){ // sort by current lsb
u = a[i];
m = (size_t)(u>>(j<<3))&0xff;
b[mIndex[j][m]++] = u;
}
std::swap(a, b); // swap ptrs
}
}
Upvotes: 4
Reputation: 58471
No, you are not stuck with quicksort. You could use, for example,
integer_sort
from
Boost.Sort
or u4_sort
from usort. When sorting this array:
array(randint(0, high=1<<32, size=10**8), uint32)
I get the following results:
NumPy quicksort: 8.636 s 1.0 (baseline) Boost.Sort integer_sort: 4.327 s 2.0x speedup usort u4_sort: 2.065 s 4.2x speedup
I would not jump to conclusions based on this single experiment and use
usort
blindly. I would test with my actual data and measure what happens.
Your mileage will vary depending on your data and on your machine. The
integer_sort
in Boost.Sort has a rich set of options for tuning, see the
documentation.
Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.
Boost.Sort
Put these lines into the spreadsort.cpp file:
#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;
extern "C" {
void spreadsort(std::uint32_t* begin, std::size_t len) {
integer_sort(begin, begin + len);
}
}
It basically instantiates the templated integer_sort
for 32 bit
unsigned integers; the extern "C"
part ensures C linkage by disabling
name mangling.
Assuming you are using gcc and that the necessary include files of boost
are under the /tmp/boost_1_60_0
directory, you can compile it:
g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so
The key flags are -fPIC
to generate
position-independet code
and -shared
to generate a
shared object
.so file. (Read the docs of gcc for further details.)
Then, you wrap the spreadsort()
C++ function
in Python using ctypes
:
from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer
__all__ = ['integer_sort']
# In spreadsort.cpp: void spreadsort(std::uint32_t* begin, std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
sort(arr, arr.size)
Alternatively, you can use cffi:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin, size_t len);')
C = ffi.dlopen('./spreadsort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('uint32_t*', arr.ctypes.data)
C.spreadsort(begin, arr.size)
At the cdll.LoadLibrary()
and ffi.dlopen()
calls I assumed that the
path to the spreadsort.so
file is ./spreadsort.so
. Alternatively,
you can write
lib = cdll.LoadLibrary('spreadsort.so')
or
C = ffi.dlopen('spreadsort.so')
if you append the path to spreadsort.so
to the LD_LIBRARY_PATH
environment
variable. See also Shared Libraries.
Usage. In both cases you simply call the above Python wrapper function integer_sort()
with your numpy array of 32 bit unsigned integers.
usort
As for u4_sort
, you can compile it as follows:
cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so
Issue this command in the directory where the u4_sort.c
file is located.
(Probably there is a less hackish way but I failed to figure that out. I
just looked into the deps.mk file in the usort directory to find out
the necessary compiler flags and include paths.)
Then, you can wrap the C function as follows:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('unsigned*', arr.ctypes.data)
C.u4_sort(begin, arr.size)
In the above code, I assumed that the path to u4_sort.so
has been
appended to the LD_LIBRARY_PATH
environment variable.
Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort()
with your numpy array of 32 bit unsigned integers.
Upvotes: 14