How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

Question

My task: take 3 lists of ints, each with some multiplier, and see if the elements can be rearranged to make two lists (with larger multipliers).

I have code that does this - looped over my whole data set, it takes about 15 seconds: (EDIT: fixed errors)

%%cython
cdef bint my_check(
    list pattern1, 
    list pattern2, 
    list pattern3, 
    int amount1,
    int amount2,
    int amount3
):

    cdef dict all_items = dict()
    cdef int i, total_amount = amount1 + amount2 + amount3, m1, m2
    cdef bint bad_split = False

    # Pool the items together.

    for i in range(len(pattern1)):
        all_items[pattern1[i]] = all_items.get(pattern1[i],0) + amount1
    for i in range(len(pattern2)):
        all_items[pattern2[i]] = all_items.get(pattern2[i],0) + amount2
    for i in range(len(pattern3)):
        all_items[pattern3[i]] = all_items.get(pattern3[i],0) + amount3

    # Iterate through possible split points:
    for m1 in range(total_amount//2, total_amount):
        m2 = total_amount - m1

        # Split items into those with quantities divisible at this point and those without
        divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 == 0}
        not_divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 != 0}

        # Check that all of the element amounts that are not divisible by m1 are divisible by m2.
        for i in not_divisible:
            if not_divisible[i]%m2 != 0:
                bad_split = True
                break
        # If there is an element that doesn't divide by either, try the next split value.
        if bad_split:
            continue
        
        items1 = {i:divisible[i]//m1 for i in divisible}
        items2 = {i:not_divisible[i]//m2 for i in not_divisible}

        if :
            return True

    # Tried all of the split points
    return False

Then if this returns True, I run another function to do the combination. On my data set, the my_check() function is being called > 150,000 times (and taking the bulk of the time) and the other function < 500 times, so I'm not too concerned with optimizing that one.

I'd like to parallelize this to improve the performance, but what I've found:

my first thought was to use numpy functions to take advantage of vectorization, by converting all_items to a numpy array, using np.mod() and np.logical_not() to split the items, and other numpy functions in the last if clause, but that blows the time up by 3-4x compared to using the dict comprehension
if I switch the m1 range to a Cython prange, the compiler complained about using Python objects without the GIL. I switched the dicts to cdef'd numpy arrays, but that was even slower. I tried using memoryviews, but they don't seem to be easily manipulated? I read in another question here that slices can't be assigned to variables, so I don't know how I'd work with them. It won't let me cdef new variables inside the for loop.

Since I'm running at different values of m1, and terminating as soon as any of them return True, it should be parallelizable without worrying about race conditions.

What should my approach be here? Numpy? Cython? Something else?

I'm happy to post more detailed errors from any of my attempts, but figured that posting them all would get overwhelming. I haven't been able to get profiling or line profiling working for this - I've added the relevant # cython: statements to the top of the Jupyter notebook cell, but it doesn't find anything when I run it.

EDIT: Per @DavidW's answer I've replaced the middle chunk of code with the following, which cuts the time in half:

items1 = dict()
items2 = dict()
bad_split = False

for k,v in all_items.items():
    if v % m1 == 0:
        items1[k] = v//m1
    elif v % m2 == 0:
        items2[k] = v//m2
    else:
        bad_split = True
        break

I'd still like to find some way of taking advantage of my multi-core processor if that's possible.

DavidW · Accepted Answer

There's definitely some improvements you can make to the loops that doesn't change the fundamental approach but may be faster. I haven't timed these so it's worth doing that rather than taking my word for it.

for i in range(len(pattern1)):
    all_items[pattern1[i] = all_items.get(pattern1[i],0) + amount1

(Ignoring the syntax error). It's generally more ideomatic to iterate by item rather than over a range, and it avoids two lookups (sometimes that isn't true in Cython, for example iterating over numpy arrays, but for a list it's probably true):

for pattern1_i in pattern1:
    all_items[pattern1_i] = all_items.get(pattern1_i,0) + amount1

More significantly you have two loops:

divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 != 0}

You're wasting a lot of time doing dict-lookups when you could iterate directly over both keys and values. For example

divisible = {k: v for k, v in all_items.items() if v//m1 == 0}

But you're also looping over the dictionary twice and performing the same test twice.

divisible = {}
not_divisible = {}
for k, v in all_items.items():
   if v//m1 == 0:
     divisible[k] = v
   else:
     not_divisible[k] = v

It might well be possible to translate your algorithm to something involving Numpy arrays, but it's a fairly significant change and beyond my interest here.

Addendum: I'm increasingly reluctant to recommend people use C++ classes in Cython these days. Mainly because a) it can often lead to quite awkward code, b) people tend to use it in a cargo-culty way because "it's C++ so it must be faster than Python objects, and c) people tend to forgot about the cost of converting their objects to/from C++ at the start and end of every function.

However, in this case it might actually be a good choice, since your dict objects are uniformly typed, and entirely contained with a single function. The key substitution is dict -> unordered_map.

What you want to do (rough outline) is

from libcpp.unordered_map cimport unordered_map

Then type all_items, items1 and items2 as cdef unordered_map[int, int (I think...). You do this typing outside the loop. The rest of your code then remains largely the same (you may need to find a substitute for dict.get...).

Once you've got it working as a serial calculation, you should be able to turn your for m1 in range(total_amount//2, total_amount): into a prange loop, and assuming everything is correctly typed then this should work in parallel. Obviously if is a big unknown.

You must treat all_items as strictly read-only during the loop to avoid race-conditions. However, items1 and items2 should be correctly identified as loop-local variables by Cython I hope.

Here's a fairly similar answer to use as a starting point. For future readers: please think twice about whether you really need to convert all your Python objects to C++ ones; you probably don't

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

Answers (1)

Related Questions