Reputation: 192
I am comparing several Python modules/extensions or methods for achieving the following:
import numpy as np
def fdtd(input_grid, steps):
grid = input_grid.copy()
old_grid = np.zeros_like(input_grid)
previous_grid = np.zeros_like(input_grid)
l_x = grid.shape[0]
l_y = grid.shape[1]
for i in range(steps):
np.copyto(previous_grid, old_grid)
np.copyto(old_grid, grid)
for x in range(l_x):
for y in range(l_y):
grid[x,y] = 0.0
if 0 < x+1 < l_x:
grid[x,y] += old_grid[x+1,y]
if 0 < x-1 < l_x:
grid[x,y] += old_grid[x-1,y]
if 0 < y+1 < l_y:
grid[x,y] += old_grid[x,y+1]
if 0 < y-1 < l_y:
grid[x,y] += old_grid[x,y-1]
grid[x,y] /= 2.0
grid[x,y] -= previous_grid[x,y]
return grid
This function is a very basic implementation of the Finite-Difference Time Domain (FDTD) method. I've implemented this function several ways:
Now I would like to compare the performance with NumbaPro CUDA.
This is the first time I am writing code for CUDA and I came up with the code below.
from numbapro import cuda, float32, int16
import numpy as np
@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16))
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y):
x,y = cuda.grid(2)
for i in range(steps):
previous_grid[x,y] = old_grid[x,y]
old_grid[x,y] = grid[x,y]
for i in range(steps):
grid[x,y] = 0.0
if 0 < x+1 and x+1 < l_x:
grid[x,y] += old_grid[x+1,y]
if 0 < x-1 and x-1 < l_x:
grid[x,y] += old_grid[x-1,y]
if 0 < y+1 and y+1 < l_x:
grid[x,y] += old_grid[x,y+1]
if 0 < y-1 and y-1 < l_x:
grid[x,y] += old_grid[x,y-1]
grid[x,y] /= 2.0
grid[x,y] -= previous_grid[x,y]
def fdtd(input_grid, steps):
grid = cuda.to_device(input_grid)
old_grid = cuda.to_device(np.zeros_like(input_grid))
previous_grid = cuda.to_device(np.zeros_like(input_grid))
l_x = input_grid.shape[0]
l_y = input_grid.shape[1]
kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y)
return grid.copy_to_host()
Unfortunately I get the following error:
File ".../fdtd_numbapro.py", line 98, in fdtd
return grid.copy_to_host()
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED
Failed to copy memory D->H
I've used grid.to_host() as well and that would work neither. CUDA is definitely working using NumbaPro on this system.
Upvotes: 1
Views: 995
Reputation: 420
I made some minor modifications to your original code to get it running in Parakeet:
1) Split compound comparisons such as "0 < x-1 < l_x" into "0 < x-1 and x-1 < l_x".
2) Replaced np.copyto with explicit indexed assignment (previous_grid[:,:] = old_grid).
After that, I compare the Parakeet runtimes for the C, OpenMP and CUDA backends against the original Python time and Numba's autojit on a 1000x1000 grid with steps = 20.
Parakeet (backend = c) cold: fdtd : 0.5590s
Parakeet (backend = c) warm: fdtd : 0.1260s
Parakeet (backend = openmp) cold: fdtd : 0.4317s
Parakeet (backend = openmp) warm: fdtd : 0.1693s
Parakeet (backend = cuda) cold: fdtd : 2.6357s
Parakeet (backend = cuda) warm: fdtd : 0.2455s
Numba (autojit) cold: 672.3666s
Numba (autojit) warm: 657.8858s
Python: 203.3907s
Since there is little readily available parallelism in your code, the parallel backends actually do worse than the sequential one. This is largely due to a difference in which loop optimizations get run by Parakeet for each backend, along with some extra overheads associated with CUDA memory transfers and starting OpenMP thread groups. I'm not sure why Numba's autojit is so slow here, I'm sure it would be faster with type annotations or using NumbaPro.
Upvotes: 1
Reputation: 146
The problem is resolved by the user. I am cross-referencing the discussion on Anaconda mailing list for this problem: https://groups.google.com/a/continuum.io/forum/#!searchin/anaconda/fdtd/anaconda/VgiN4h37UrA/18tAc60EIkcJ
Upvotes: 3