pythonc++11parallel-processingcythonnumba

Reputation: 33

Python, efficient paralell operation using a dict

First sorry for my not perfect English.

My problem is simple to explain, I think.

result={}
list_tuple=[(float,float,float),(float,float,float),(float,float,float)...]#200k tuples
threshold=[float,float,float...] #max 1k values
for tuple in list_tuple:
    for value in threeshold:
    if max(tuple)>value and min(tuple)<value:
        if value in result:
            result[value].append(tuple)
        else:
            result[value]=[]
            result[value].append(tuple)

list_tuple contains arround 200k tuples, i have to do this operation very fast(2/3 seconds max on a normal pc).

My first attemp was to do this in cython with prange() (so i could have benefits from the cython optimization and from the paralell execution), but the problem is (as always), GIL: in prange() i can manage lists and tuples using cython memviews, but i can't insert my result in a dict.

In cython i also tried using unordered_map of the c++ std, but now the problem is that i can't make a vector of array in c++ (that would the value of my dict).

The second problem is similar:

list_tuple=[((float,float),(float,float)),((float,float),(float,float))...]#200k tuples of tuples

result={list_tuple[0][0]:[]}

for tuple in list_tuple:
    if tuple[0] in result:
        result[tuple[0]].append(tuple)
    else:
        result[tuple[0]]=[]

Here i have also another problem,if a want to use prange() i have to use a custom hash function to use an array as key of a c++ unordered_map

As you can see my snippets are very simple to run in paralell.

I thought to try with numba, but probably will be the same because of GIL, and i prefer to use cython because i need a binary code (this library could be a part of a commercial software so only binary libraries are allowed).

In general i would like avoid c/c++ function, what i hope to find is a way to manage something like dicts/lists in parallel,with the cython performance, remaining as much as possible in the Python domain; but i'm open to every advice.

Thanks

Upvotes: 3

Answers (3)

a_guest

Reputation: 36289

Edit

Since this approach basically performs an outer product between data samples and threshold values it increases the required memory significantly which might be undesired. An improved approach can be found here. I keep this answer nevertheless for future reference since it was referred to in this answer.

I found the performance increase as compared to the OP's code to be a factor of ~ 20.

This is an example using numpy. The data is vectorized and so are the operations. Note that the resulting dict contains empty lists, as opposed to the OP's example, and hence might require an additional cleaning step, if appropriate.

import numpy as np

# Data setup
data = np.random.uniform(size=(200000, 3))
thresh = np.random.uniform(size=1000)

# Compute tuples for thresholds.
condition = (
    (data.min(axis=1)[:, None] < thresh)
    & (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}

Upvotes: 0

a_guest

Reputation: 36289

Several performance improvements can be achieved, also by using numpy's vectorization features:

The min and max values are currently computed anew for each threshold. Instead they can be precomputed and then reused for each threshold.
The loop over data samples (list_tuple) is performed in pure Python. This loop can be vectorized using numpy.

In the following tests I used data.shape == (200000, 3); thresh.shape == (1000,) as indicated in the OP. I also omitted modifications to the result dict since depending on the data this can quickly overflow memory.

Applying 1.

v_min = [min(t) for t in data]
v_max = [max(t) for t in data]
for mi, ma in zip(v_min, v_max):
    for value in thresh:
        if ma > value and mi < value:
            pass

This yields a performance increase of ~ 5 compared to the OP's code.

Applying 1. & 2.

v_min = data.min(axis=1)
v_max = data.max(axis=1)
mask = np.empty(shape=(data.shape[0],), dtype=bool)
for t in thresh:
    mask[:] = (v_min < t) & (v_max > t)
    samples = data[mask]
    if samples.size > 0:
        pass

This yields a performance increase of ~ 30 compared to the OP's code. This approach has the additional benefit that it doesn't contain incremental appends to the lists which can slow down the program since memory reallocation might be required. Instead it creates each list (per threshold) in a single attempt.

Upvotes: 1

hpaulj

Reputation: 231425

@a_guest's code:

def foo1(data, thresh):
    data = np.asarray(data)
    thresh = np.asarray(thresh)
    condition = (
       (data.min(axis=1)[:, None] < thresh)
       & (data.max(axis=1)[:, None] > thresh)
       )
    result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
    return result

This code creates a dictionary entry once for each item in thresh.

The OP code, simplified a bit with default_dict (from collections):

def foo3(list_tuple, threeshold):
    result = defaultdict(list)
    for tuple in list_tuple:
        for value in threeshold:
            if max(tuple)>value and min(tuple)<value:
                result[value].append(tuple)
    return result

This one updates a dictionary entry once for each item that meets the criteria.

And with his sample data:

In [27]: foo1(data,thresh)
Out[27]: {0: [], 1: [[0, 1, 2]], 2: [], 3: [], 4: [[3, 4, 5]]}
In [28]: foo3(data.tolist(), thresh.tolist())
Out[28]: defaultdict(list, {1: [[0, 1, 2]], 4: [[3, 4, 5]]})

time tests:

In [29]: timeit foo1(data,thresh)
66.1 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# In [30]: timeit foo3(data,thresh)
# 161 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [31]: timeit foo3(data.tolist(),thresh.tolist())
30.8 µs ± 56.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Iteration on arrays is slower than with lists. Time for tolist() is minimal; np.asarray for lists is longer.

With a larger data sample, the array version is faster:

In [42]: data = np.random.randint(0,50,(3000,3))
    ...: thresh = np.arange(50)
In [43]: 
In [43]: timeit foo1(data,thresh)
16 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %%timeit x,y = data.tolist(), thresh.tolist() 
    ...: foo3(x,y)
    ...: 
83.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)