cirobr
cirobr

Reputation: 31

CUDA example in Julia doesn't use GPU

I'm doing my first steps on running Julia 1.6.5 code on GPU. For some reason, it seems the GPU is not being used at all. These are the steps:

First of all, my GPU passed on the test recommended at CUDA Julia Docs:

# install the package
using Pkg
Pkg.add("CUDA")
  
# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()

using Pkg
Pkg.test("CUDA")    # takes ~40 minutes if using 1 thread

Secondly, the below code took around 8 minutes (real time) for supposedly running on my GPU. It loads and multiplies, for 10 times, two matrices 10000 x 10000:

using CUDA
using Random
N = 10000

a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))

for i in 1:10
    global a_d = randn(N, N)
    global b_d = randn(N, N)

    global c_d = a_d * b_d
end

global a_d = nothing
global b_d = nothing
global c_d = nothing
GC.gc()

Outcome on terminal as follows:

(base) ciro@ciro-G3-3500:~/projects/julia/cuda$ time julia cuda-gpu.jl

real    8m13,016s
user    50m39,146s
sys 13m16,766s

Then, an equivalent code for the CPU is run. Execution time was also equivalent:

using Random
N = 10000

for i in 1:10
    a = randn(N, N)
    b = randn(N, N)

    c = a * b
end

Execution:

(base) ciro@ciro-G3-3500:~/projects/julia/cuda$ time julia cuda-cpu.jl

real    8m2,689s 
user    50m9,567s 
sys 13m3,738s

Moreover, by following the info on NVTOP screen command, it is weird to see the GPU memory and cores being loaded/unloaded accordingly, besides still using the same 800% CPUs (or eight cores) of my regular CPU, which is the same usage the CPU-version has.

Any hint is greatly appreciated.

Upvotes: 1

Views: 1206

Answers (2)

cirobr
cirobr

Reputation: 31

After playing a little bit, the following code also works. Interesting to note the declaration "global" on c_d variable. Without it, the system complained about ambiguity between the (global) CuArray c_d and the assumption of a different (unintended local) variable c_d.

using CUDA
using Random
N = 10000

a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))

for i in 1:10
    randn!(a_d)
    randn!(b_d)

    global c_d = a_d * b_d
end

global a_d = nothing
global b_d = nothing
global c_d = nothing
GC.gc()

The outcome on a relatively modest GPU confirms the result:

(base) ciro@ciro-Inspiron-7460:~/projects/julia/cuda$ time julia cuda-gpu.jl

real    0m38,243s
user    0m36,810s
sys 0m1,413s

Upvotes: 1

S.Surace
S.Surace

Reputation: 206

There are a couple of things that prevent your code from working properly and fast.

First, you are overwriting your allocated CuArrays with normal CPU Arrays by using randn, which means that the matrix multiplication runs on the CPU. You should use CUDA.randn instead. By using CUDA.randn!, you are not allocating any memory beyond what was already allocated.

Secondly, you are using global variables and the global scope, which is bad for performance.

Thirdly, you are using C = A * B which reallocates memory. You should use the in-place version mul! instead.

I would propose the following solution:

using CUDA
using LinearAlgebra
N = 10000

a_d = CuArray{Float32}(undef, (N, N))
b_d = CuArray{Float32}(undef, (N, N))
c_d = CuArray{Float32}(undef, (N, N))

# wrap your code in a function
# `!` is a convention to indicate that the arguments will be modified
function randn_mul!(A, B, C)
    CUDA.randn!(A)
    CUDA.randn!(B)
    mul!(C, A, B)
end

# use CUDA.@time to time the GPU execution time and memory usage:
for i in 1:10
    CUDA.@time randn_mul!(a_d, b_d, c_d)
end

which runs pretty fast on my machine:

$ time julia --project=. cuda-gpu.jl
  2.392889 seconds (4.69 M CPU allocations: 263.799 MiB, 6.74% gc time) (2 GPU allocations: 1024.000 MiB, 0.05% memmgmt time)
  0.267868 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.274376 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.268574 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.274514 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.272016 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.272668 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.273441 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.274318 seconds (59 CPU allocations: 1.672 KiB) (2 GPU allocations: 1024.000 MiB, 0.01% memmgmt time)
  0.272389 seconds (60 CPU allocations: 2.000 KiB) (2 GPU allocations: 1024.000 MiB, 0.00% memmgmt time)

real    0m8.726s
user    0m6.030s
sys     0m0.554s

Note that the first time the function was called, the execution time and memory usage was higher because you are measuring compilation time any time a function is first called with a given type signature.

Upvotes: 5

Related Questions