Reputation: 131
I am currently looping over a numpy array to slice it and do some ndarray array. Just the time neeed is currently much to long, due to the sice of the array of 2001*2001 elements. Therefore I hope, that someone might surgest a hint, how to speedup the code:
import cupy as cp
from time import time
height, width = 187, 746
org_sized = cp.random.rand(2001, 2001) * 60
height_mat = cp.random.rand(height, width) * 100 # orinally values getting larger from (0, width//2) to the outside with the distance squared
indices = cp.indices((height, width))
y_offsets = indices[0]
x_offsets = indices[1] - (width + 1)/2
angle_mat = cp.round_(2*(90 - cp.rad2deg(cp.arctan2(y_offsets, x_offsets))) + 180).astype(int)
weights = cp.random.rand(361)/ 10 # weights oroiginally larger in the middle
# pad the org_sized matrix with zeros to a fit a size of (2001+heigth, 2001+weight)
west = cp.zeros((org_sized.shape[0], width // 2))
east = cp.zeros((org_sized.shape[0], round(width // 2)))
enlarged_size = cp.hstack((west, org_sized))
enlarged_size = cp.hstack((enlarged_size, east))
south = cp.zeros((height, enlarged_size.shape[1]))
enlarged_size = cp.vstack((enlarged_size, south))
shadow_time_hrs = cp.zeros_like(org_sized)
for y in range(org_sized.shape[0]):
start_time = time()
for x in range(org_sized.shape[1]):
# shift h_extras and angles that they match in size, and are correctly aligned
short_elevations = enlarged_size[y:y+height, x:x+width]
overshadowed = (short_elevations - org_sized[y, x]) > height_mat
shadowed_angles = angle_mat * overshadowed
shadowed_segments = cp.unique(shadowed_angles)
angle_segments = shadowed_segments
sum_hours = cp.sum(weights[angle_segments])
shadow_time_hrs[y, x] = sum_hours
if (y % 100) == 0:
print(f"Computation for line {y} took: {time() - start_time}.")
Firstly I used numbas @njit on the function calc_shadow_point, but it turned out, that it was 2 times slower than without. Therefore I switched to the numpy arrays to cupy arrays. Which give an speed-up of about 50 %. Probapy because the arrays are so small.
Are there other ways than to iteratere for this kind of problem, or is there a way to iterate with multi-threading over the iterators?
Edit: I changed the code to a minimum example of the same runtime (1.1 s per line of org_sized). Somehow I have to increase the computation speed. Everything below 10 % of the current computation time would make the code usable. Due to the remarks I changed to np.unique to cp.unique, but as remarked. It didn't result in large speed-up barely 6 %. I am currently using a GTX 1060. But when it would help could manage to use a 1660 Ti.
Upvotes: 1
Views: 1009
Reputation: 50836
unique
is slow (both on CPUs and GPUs) because it generally either use internally a hash-map or a sort. Moreover, as you said, the array are too small to be efficient on the GPU resulting in huge kernel overheads. Hopefully, you do not need it: you can use bincount
(with minlength=361
and a flatten array) because you know the values are small positive integers in the bounded range 0:361
. Actually, you do not actually need to count the values like bincount
does, you just want to know which values of the range 0:361
exists in shadowed_angles
. Thus, a faster implementation of bincount
can be written using Numba. Moreover, the array computations can be done in a row reducing the amount of allocations and the memory pressure. Finally, parallelism can be used to speed up the computation (using prange
and parallel=True
of Numba).
Here is the resulting CPU-based implementation:
@nb.njit
def computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y, x):
height, width = height_mat.shape
short_elevations = enlarged_size[y:y+height, x:x+width]
shadowed_segments = np.zeros(361)
for y2 in range(height):
for x2 in range(width):
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
shadowed_angle = angle_mat[y2, x2] * overshadowed
shadowed_segments[shadowed_angle] = weights[shadowed_angle]
return shadowed_segments.sum()
@nb.njit(parallel=True)
def computeLine(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y):
height, width = height_mat.shape
for x in nb.prange(org_sized.shape[1]):
shadow_time_hrs[y, x] = computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y, x)
def computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs):
height, width = height_mat.shape
for y in range(org_sized.shape[0]):
start_time = time()
computeLine(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, y)
if (y % 100) == 0:
print("Computation for line %d took: %f." % (y, time() - start_time))
computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs)
Here are the timing results per iteration on my machine (using a i7-9600K and a GTX-1660-Super):
Reference implementation (CPU): 2.015 s
Reference implementation (GPU): 0.882 s
Optimized implementation (CPU): 0.082 s
This is 10 times faster than the reference GPU-based implementation and 25 times faster than the reference CPU-based one.
Note that the same technique can be used on GPU, but not using CuPy: one need to write a GPU kernel doing that (e.g. using CUDA). However, this is quite complex to do that efficiently.
Upvotes: 3