user2983638
user2983638

Reputation: 941

Solving 1D heat equation on GPU in Numba

I am new to the use of GPUs, and I am trying to write a kernel in Numba to solve numerically the 1D heat equation. I also wrote a Numpy version of the PDE solver, and it turned out that the GPU kernel doesn't provide the correct result. Below I show a comparison of the state vectors computed by the two scripts:

enter image description here

Moreover, the kernel generates slightly different results at each run. It's probably some problem related to threads management, even though I synchronized the threads at every iteration. Some help would be really appreciated.

from numba import cuda, void, float32
import numpy as np
import scipy.stats as stats
import time
import matplotlib.pyplot as plt

    
##################### Numba GPU Version


@cuda.jit(void(float32[::1], float32[::1]))
def solve_pde(u, parameters):

    # Space and time parameters
    dx = parameters[0]
    dt = parameters[1]
    t = parameters[2]
    t_end = parameters[3]
    u_size = u.size

    # Index of thread on GPU
    i = cuda.grid(1)

    # Condition to avoid threads accessing indices out of array
    if i < u_size:

        while t < t_end:
            
            if(i in [0, 1, u_size-2, u_size-1]):  # Natural boundary conditions
                
                u[i] = np.float32(0.)
            
            else:
                
                # Compute second order derivatives
                RHS = np.float32(0.005)*(u[i + 1] - 2*u[i] + u[i - 1])/(dx*dx)

                # Update state vector
                u[i] += RHS*dt

            # Update time
            t += dt
            
            # Wait until all threads finish computing
            cuda.syncthreads()


# Space and time parameters
dx = 0.01
dt = 0.01
t0 = 0
t_end = 200
parameters = np.array([dx, dt, t0, t_end], dtype="float32")

# Initial state vector
x = np.linspace(0, 6, int(6/dx), dtype="float32")
u = np.array(stats.norm.pdf(x, 3, 0.3), dtype="float32")

# Manage the number of threads
threads_per_block = 32
blocks_per_grid = (u.size + (threads_per_block - 1)) \
        // threads_per_block

# Send the state vector and the parameters to the device
d_u = cuda.to_device(u)
d_parameters = cuda.to_device(parameters)

# Start timer
start = time.perf_counter()

# Start parallel simulations
solve_pde[blocks_per_grid, threads_per_block](d_u, d_parameters)

# Move the final state vector to the host
u_end = d_u.copy_to_host()

# Measure the time elapsed and print the result
end = time.perf_counter()
print(end - start)

# Plot the final state vector
plt.figure(figsize=(14, 10))
plt.plot(x, u_end, 'b-')



##################### Numpy Version



u = np.array(stats.norm.pdf(x, 3, 0.3), dtype="float32")
u_size = u.size

t = t0


while t < t_end:
    
    for i in range(u_size):
        
        if(i in [0, 1, u_size-2, u_size-1]):
            
            u[i] = 0
        
        else:
               
            RHS = 0.005*(u[i + 1] - 2*u[i] + u[i - 1])/(dx*dx)
            u[i] += RHS*dt
    
    t += dt


plt.figure(figsize=(14, 10))
plt.plot(x, u, 'r-')  

Upvotes: 1

Views: 946

Answers (2)

user2983638
user2983638

Reputation: 941

I post the code I have written by following the suggestions above. I compared the results generated by the Numba and Numpy scripts with the analytical solution to the 1D heat equation with Gaussian initial condition. Now everything works smoothly.

enter image description here

Here is the new code:

from numba import cuda, void, float32
import numpy as np
import scipy.stats as stats
import time
import matplotlib.pyplot as plt

    
##################### Numba GPU Version


@cuda.jit(void(float32[::1], float32[::1], float32[::1]))
def solve_pde(u, v, parameters):

    # Equation parameters
    dx = parameters[0]
    dt = parameters[1]
    D =  parameters[2]
    u_size = u.size

    # Index of thread on GPU
    i = cuda.grid(1)

    # Condition to avoid threads accessing indices out of array
    if i < u_size:
            
        if(i in [0, 1, u_size-2, u_size-1]):  # Natural boundary conditions
                
            v[i] = np.float32(0.)
            
        else:
                
            # Compute second order derivatives
            RHS = D*(u[i + 1] - 2*u[i] + u[i - 1])/(dx*dx)

            # Update state vector
            v[i] = u[i] + RHS*dt


# Equation parameters
dx = 0.01
dt = 0.01
t0 = 0
t_end = 100
D = 0.005  # Diffusion coefficient
parameters = np.array([dx, dt, D], dtype="float32")

# Initial state vector
mu = 3
std = 0.3
x = np.linspace(0, 6, int(6/dx), dtype="float32")
u = np.array(stats.norm.pdf(x, mu, std), dtype="float32")

# Manage the number of threads
threads_per_block = 32
blocks_per_grid = (u.size + (threads_per_block - 1)) \
        // threads_per_block

# Start timer
start = time.perf_counter()

# Initialize the state vector u and the buffer v on the GPU
d_u = cuda.to_device(u)
d_v = cuda.device_array(u.shape, dtype="float32")
d_parameters = cuda.to_device(parameters)

# Loop across time
for i in range(int((t_end-t0)/dt)):
    
    if (i % 2) == 0:
        
        solve_pde[blocks_per_grid, threads_per_block](d_u, d_v, d_parameters)
    
    else:
        
        solve_pde[blocks_per_grid, threads_per_block](d_v, d_u, d_parameters)
    
    
u = d_u.copy_to_host()

# Measure the time elapsed and print the result
end = time.perf_counter()
print(end - start)

# Plot the final state vector
plt.figure(figsize=(14, 10))
plt.plot(x, u, 'b-')



##################### Numpy Version



u = np.array(stats.norm.pdf(x, mu, std), dtype="float32")
v = np.empty_like(u)
u_size = u.size

start = time.perf_counter()


t = t0


while t < t_end:
    
    for i in range(u_size):
                
        if(i in [0, 1, u_size-2, u_size-1]):
            
            v[i] = 0
        
        else:
            
            RHS = D*(u[i + 1] - 2*u[i] + u[i - 1])/(dx*dx)
            v[i] = RHS*dt
        
    u += v
    t += dt


end = time.perf_counter()
print(end-start)
plt.figure(figsize=(14, 10))
plt.plot(x, u, 'r-')



##################### Analytical Solution



plt.figure(figsize=(14, 10))
plt.plot(x, (1/np.sqrt(np.pi*(2*std**2+4*D*t_end)))*np.exp(-((x-mu)**2)/(2*std**2+4*D*t_end)), 'g-')

Upvotes: 1

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50358

The issue certainly comes from u being both read and written by GPU threads at the same time causing a race condition. You need to works on two different buffers so to prevent this problem. Note that you can swap the buffers at the end of a computation step.

Moreover, note that cuda.syncthreads does not "Wait until all threads finish computing". It is block level synchronization barrier. AFAIK, If you want to wait for all thread to finish for a given time-step, the only way is to run the CUDA kernel once again (one per time-step).

Note that running a kernel is quite expensive so running such a computation on a GPU is only useful compared to CPUs if the array to be computed is huge (eg. certainly at least 100_000 in your case). Besides this, note that 1.0/(dx*dx) can be precomputed to avoid a slow division.

Upvotes: 2

Related Questions