Reputation: 192

Converting function to NumbaPro CUDA

I am comparing several Python modules/extensions or methods for achieving the following:

import numpy as np

def fdtd(input_grid, steps):
    grid = input_grid.copy()
    old_grid = np.zeros_like(input_grid)
    previous_grid = np.zeros_like(input_grid)

    l_x = grid.shape[0]
    l_y = grid.shape[1]

    for i in range(steps):
        np.copyto(previous_grid, old_grid)
        np.copyto(old_grid, grid)

        for x in range(l_x):
            for y in range(l_y):
                grid[x,y] = 0.0
                if 0 < x+1 < l_x:
                    grid[x,y] += old_grid[x+1,y]
                if 0 < x-1 < l_x:
                    grid[x,y] += old_grid[x-1,y]
                if 0 < y+1 < l_y:
                    grid[x,y] += old_grid[x,y+1]
                if 0 < y-1 < l_y:
                    grid[x,y] += old_grid[x,y-1]

                grid[x,y] /= 2.0
                grid[x,y] -= previous_grid[x,y]

    return grid

This function is a very basic implementation of the Finite-Difference Time Domain (FDTD) method. I've implemented this function several ways:

with more NumPy routines
in Cython
using Numba (auto)jit.

Now I would like to compare the performance with NumbaPro CUDA.

This is the first time I am writing code for CUDA and I came up with the code below.

from numbapro import cuda, float32, int16
import numpy as np

@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16))
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y):

    x,y = cuda.grid(2)

    for i in range(steps):
        previous_grid[x,y] = old_grid[x,y]
        old_grid[x,y] = grid[x,y]  

    for i in range(steps):

        grid[x,y] = 0.0

        if 0 < x+1 and x+1 < l_x:
            grid[x,y] += old_grid[x+1,y]
        if 0 < x-1 and x-1 < l_x:
            grid[x,y] += old_grid[x-1,y]
        if 0 < y+1 and y+1 < l_x:
            grid[x,y] += old_grid[x,y+1]
        if 0 < y-1 and y-1 < l_x:
            grid[x,y] += old_grid[x,y-1]

        grid[x,y] /= 2.0
        grid[x,y] -= previous_grid[x,y]


def fdtd(input_grid, steps):

    grid = cuda.to_device(input_grid)
    old_grid = cuda.to_device(np.zeros_like(input_grid))
    previous_grid = cuda.to_device(np.zeros_like(input_grid))

    l_x = input_grid.shape[0]
    l_y = input_grid.shape[1]

    kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y)

    return grid.copy_to_host()

Unfortunately I get the following error:

  File ".../fdtd_numbapro.py", line 98, in fdtd
    return grid.copy_to_host()
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED
Failed to copy memory D->H

I've used grid.to_host() as well and that would work neither. CUDA is definitely working using NumbaPro on this system.

Upvotes: 1

Answers (2)

Alex Rubinsteyn

Reputation: 420

I made some minor modifications to your original code to get it running in Parakeet:

1) Split compound comparisons such as "0 < x-1 < l_x" into "0 < x-1 and x-1 < l_x".

2) Replaced np.copyto with explicit indexed assignment (previous_grid[:,:] = old_grid).

After that, I compare the Parakeet runtimes for the C, OpenMP and CUDA backends against the original Python time and Numba's autojit on a 1000x1000 grid with steps = 20.

Parakeet (backend = c) cold: fdtd : 0.5590s
Parakeet (backend = c) warm: fdtd : 0.1260s

Parakeet (backend = openmp) cold: fdtd : 0.4317s
Parakeet (backend = openmp) warm: fdtd : 0.1693s

Parakeet (backend = cuda) cold: fdtd : 2.6357s
Parakeet (backend = cuda) warm: fdtd : 0.2455s

Numba (autojit) cold: 672.3666s
Numba (autojit) warm: 657.8858s

Python: 203.3907s

Since there is little readily available parallelism in your code, the parallel backends actually do worse than the sequential one. This is largely due to a difference in which loop optimizations get run by Parakeet for each backend, along with some extra overheads associated with CUDA memory transfers and starting OpenMP thread groups. I'm not sure why Numba's autojit is so slow here, I'm sure it would be faster with type annotations or using NumbaPro.

Upvotes: 1

sklam

Reputation: 146

The problem is resolved by the user. I am cross-referencing the discussion on Anaconda mailing list for this problem: https://groups.google.com/a/continuum.io/forum/#!searchin/anaconda/fdtd/anaconda/VgiN4h37UrA/18tAc60EIkcJ

Upvotes: 3

Converting function to NumbaPro CUDA

Answers (2)

Related Questions