Reputation: 2729
I have recently started looking into using cuda for optimising searches over numeric arrays. I have a simplified piece of code below which demonstrates the issue.
import numpy as np
import time
from numba import cuda
@cuda.jit
def count_array4(device_array, pivot_point, device_output_array):
for i in range(len(device_array)):
if (pivot_point - 0.05) < device_array[i] < (pivot_point + 0.05):
device_output_array[i] = True
else:
device_output_array[i] = False
width = 512
height = 512
size = width * height
print(f'Number of records {size}')
array_of_random = np.random.rand(size)
device_array = cuda.to_device(array_of_random)
start = time.perf_counter()
device_output_array = cuda.device_array(size)
print(f'Copy Host to Device: {time.perf_counter() - start}')
for x in range(10):
start = time.perf_counter()
count_array4[512, 512](device_array, .5, device_output_array)
print(f'Run: {x} Time: {time.perf_counter() - start}')
start = time.perf_counter()
output_array = device_output_array.copy_to_host()
print(f'Copy Device to Host: {time.perf_counter() - start}')
print(np.sum(output_array))
This gives me the expected optimization in processing, however the time it takes to return the data to the host seems extremely high.
Number of records 262144
Copy Host to Device: 0.00031610000000004135
Run: 0 Time: 0.0958601
Run: 1 Time: 0.0001626999999999601
Run: 2 Time: 0.00012100000000003774
Run: 3 Time: 0.00011590000000005762
Run: 4 Time: 0.00011419999999995323
Run: 5 Time: 0.0001126999999999656
Run: 6 Time: 0.00011289999999997136
Run: 7 Time: 0.0001122999999999541
Run: 8 Time: 0.00011490000000002887
Run: 9 Time: 0.00011200000000000099
Copy Device to Host: 13.0583358
26110.0
I'm fairly sure that I'm missing something basic here, or a technique that I don't know the correct term to search for. If anyone can point me in the right direction I'd be very grateful.
Upvotes: 3
Views: 1588
Reputation: 72348
Kernel launches are asynchronous and the driver can queue multiple launches. As a result, you are measuring only kernel launch overhead within the loop, and then the data transfer, which is a blocking call, captures all the kernel execution time. You can change this behaviour by modifying your code like this:
for x in range(10):
start = time.perf_counter()
count_array4[512, 512](device_array, .5, device_output_array)
cuda.synchronize()
print(f'Run: {x} Time: {time.perf_counter() - start}')
The synchronize call ensures each kernel launch is completed and the device is idle before another kernel launches. The effect should be that each kernel run time will increase, the indicated transfer time will decrease.
Upvotes: 4