Sebastian H
Sebastian H

Reputation: 131

Faster iteration over 2D Numpy/CuPy arrays based on unique values

I am currently looping over a numpy array to slice it and do some ndarray array. Just the time neeed is currently much to long, due to the sice of the array of 2001*2001 elements. Therefore I hope, that someone might surgest a hint, how to speedup the code:

import cupy as cp
from time import time

height, width = 187, 746
org_sized = cp.random.rand(2001, 2001) * 60

height_mat = cp.random.rand(height, width) * 100 # orinally values getting larger from (0, width//2) to the outside with the distance squared

indices = cp.indices((height, width))
y_offsets = indices[0]
x_offsets = indices[1] - (width + 1)/2
angle_mat = cp.round_(2*(90 - cp.rad2deg(cp.arctan2(y_offsets, x_offsets))) + 180).astype(int)

weights = cp.random.rand(361)/ 10  # weights oroiginally larger in the middle

# pad the org_sized matrix with zeros to a fit a size of (2001+heigth, 2001+weight)
west = cp.zeros((org_sized.shape[0], width // 2))
east = cp.zeros((org_sized.shape[0], round(width // 2)))

enlarged_size = cp.hstack((west, org_sized))
enlarged_size = cp.hstack((enlarged_size, east))

south = cp.zeros((height, enlarged_size.shape[1]))

enlarged_size = cp.vstack((enlarged_size, south))

shadow_time_hrs = cp.zeros_like(org_sized)


for y in range(org_sized.shape[0]):
    start_time = time()
    for x in range(org_sized.shape[1]):
        # shift h_extras and angles that they match in size, and are correctly aligned
        short_elevations = enlarged_size[y:y+height, x:x+width]

        overshadowed = (short_elevations - org_sized[y, x]) > height_mat
        shadowed_angles = angle_mat * overshadowed
        shadowed_segments = cp.unique(shadowed_angles)
        angle_segments = shadowed_segments

        sum_hours = cp.sum(weights[angle_segments])
        shadow_time_hrs[y, x] = sum_hours
    if (y % 100) == 0:
        print(f"Computation for line {y} took: {time() - start_time}.")

Firstly I used numbas @njit on the function calc_shadow_point, but it turned out, that it was 2 times slower than without. Therefore I switched to the numpy arrays to cupy arrays. Which give an speed-up of about 50 %. Probapy because the arrays are so small.

Are there other ways than to iteratere for this kind of problem, or is there a way to iterate with multi-threading over the iterators?

Edit: I changed the code to a minimum example of the same runtime (1.1 s per line of org_sized). Somehow I have to increase the computation speed. Everything below 10 % of the current computation time would make the code usable. Due to the remarks I changed to np.unique to cp.unique, but as remarked. It didn't result in large speed-up barely 6 %. I am currently using a GTX 1060. But when it would help could manage to use a 1660 Ti.

Upvotes: 1

Views: 1009

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50836

unique is slow (both on CPUs and GPUs) because it generally either use internally a hash-map or a sort. Moreover, as you said, the array are too small to be efficient on the GPU resulting in huge kernel overheads. Hopefully, you do not need it: you can use bincount (with minlength=361 and a flatten array) because you know the values are small positive integers in the bounded range 0:361. Actually, you do not actually need to count the values like bincount does, you just want to know which values of the range 0:361 exists in shadowed_angles. Thus, a faster implementation of bincount can be written using Numba. Moreover, the array computations can be done in a row reducing the amount of allocations and the memory pressure. Finally, parallelism can be used to speed up the computation (using prange and parallel=True of Numba).

Here is the resulting CPU-based implementation:

@nb.njit
def computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y, x):
    height, width = height_mat.shape
    short_elevations = enlarged_size[y:y+height, x:x+width]
    shadowed_segments = np.zeros(361)

    for y2 in range(height):
        for x2 in range(width):
            overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
            shadowed_angle = angle_mat[y2, x2] * overshadowed
            shadowed_segments[shadowed_angle] = weights[shadowed_angle]

    return shadowed_segments.sum()

@nb.njit(parallel=True)
def computeLine(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y):
    height, width = height_mat.shape

    for x in nb.prange(org_sized.shape[1]):
        shadow_time_hrs[y, x] = computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y, x)

def computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs):
    height, width = height_mat.shape

    for y in range(org_sized.shape[0]):
        start_time = time()
        computeLine(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y)
        if (y % 100) == 0:
            print("Computation for line %d took: %f." % (y, time() - start_time))

computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs)

Here are the timing results per iteration on my machine (using a i7-9600K and a GTX-1660-Super):

Reference implementation (CPU): 2.015 s
Reference implementation (GPU): 0.882 s
Optimized implementation (CPU): 0.082 s

This is 10 times faster than the reference GPU-based implementation and 25 times faster than the reference CPU-based one.

Note that the same technique can be used on GPU, but not using CuPy: one need to write a GPU kernel doing that (e.g. using CUDA). However, this is quite complex to do that efficiently.

Upvotes: 3

Related Questions