Makc
Makc

Reputation: 307

Improving Python + numpy array allocation/initialization performance

I'm writing a python program, using some external functionality from DLL. My problem is passing matrices (numpy arrays in python) in and out of C code, now i'm using following code to receive data from DLL:

peak_count = ct.c_int16()
peak_wl_array = np.zeros(512, dtype=np.double)
peak_pwr_array = np.zeros(512, dtype=np.double)

res = __dll.DLL_Search_Peaks(ctypes.c_int(data.shape[0])
                             ctypes.c_void_p(data_array.ctypes.data),
                             ctypes.c_void_p(peak_wl_array.ctypes.data),
                             ctypes.c_void_p(peak_pwr_array.ctypes.data),
                             ct.byref(peak_count))

It works like a charm, but my problem is numpy allocating speed - even without calling DLL (just commented) i've got 3.1 seconds per 100`000 calls.

It's just allocating with np.zeros() and taking writeable pointer with ctypes.c_void_p(D.ctypes.data)

I need to process about 20`000 calls per second, so almost all time spending on just allocating memory.

I think about cython, but it will not speed up numpy arrays, so i'll get no any profit.

Is there faster way to receive matrices-like data from C-written DLL.

Upvotes: 0

Views: 1824

Answers (1)

Jonathan Dursi
Jonathan Dursi

Reputation: 50957

Memory operations are expensive, numpy or otherwise.

If you're going to be allocating a lot of arrays, it's a good idea to see if you can just do the allocation once, and either use views or subarrays to use just part of the array:

import numpy as np

niters=10000
asize=512

def forig():
    for i in xrange(niters):
        peak_wl_array = np.empty((asize), dtype=np.double)
        peak_pwr_array = np.empty((asize), dtype=np.double)

    return peak_pwr_array


def fviews():
    peak_wl_arrays  = np.empty((asize*niters), dtype=np.double)
    peak_pwr_arrays = np.empty((asize*niters), dtype=np.double)

    for i in xrange(niters):
        # create views
        peak_wl_array  = peak_wl_arrays[i*asize:(i+1)*asize]
        peak_pwr_array = peak_pwr_arrays[i*asize:(i+1)*asize]
        # then do something

    return peak_pwr_emptys


def fsubemptys():
    peak_wl_arrays  = np.empty((niters,asize), dtype=np.double)
    peak_pwr_arrays = np.empty((niters,asize), dtype=np.double)

    for i in xrange(niters):
        # do something with peak_wl_arrays[i,:]

    return peak_pwr_emptys


import timeit

print timeit.timeit(forig,number=100)
print timeit.timeit(fviews,number=100)
print timeit.timeit(fsubemptys,number=100)

Running gives

3.41996979713
0.844147920609
0.00169682502747

Note that if you're using (say) np.zeros, on the other hand, you're spending most of your time initializing memory, not allocating memory, and that's always going to take substantially longer, erasing most of the difference between these approaches:

4.20200014114
5.43090081215
4.58127593994

Good single-threaded bandwidth to main memory on newer systems is going to be something like ~10GB/s (1 billion doubles/sec), so it's always going to take about

1024 doubles/call / (1 billion doubles/sec) ~ 1 microsecond/call

to zero out the memory, which is already a significant chunk of the time that you're seeing. Still, if you initialize a single large array before making the calls, the total time of execution will be the same but the latency of each call will be lower.

Upvotes: 2

Related Questions