Reputation: 33
I have some familiarity with Halide and am starting to learn to use CUDA with it. To start with I ran the halide cuda_mat_mul
that comes with Halide source code.
I got some reasonable if unimpressive timings:
CPU, autoschedule, Adams2019: 4.2ms
GPU, autoschedule, Anderson2021: 3.0ms
GPU, manual schedule: 1.2ms
CUBLAS: 0.42ms
Does this seem right? I have an Nvidia RTX 3050 Ti laptop GPU and a core i5-11400h CPU
I then tried to get another sample app: camera_pipe
running on the GPU.
It comes with schedules for both CPU and GPU. The CMake file is for CPU only. I modified it to do a CUDA build by setting
FEATURES cuda cuda_capability_50
and giving it CUDA_INCLUDE_DIRS
and CUDA_LIBRARIES
just like in the cuda_mat_mul
app.
I also added
output.copy_to_host();
in process.cpp
.
I recorded the following run times:
CUDA (manual): 1270ms
cpu auto_schedule Adams19: 10.5ms
cpu manual: 9.9ms
So CUDA was way slower than CPU.
This was with a single timing iteration. I then tried doing 100 iterations.
CUDA manual 1st iteration: 1270ms
next 100 iterations: 5.7ms
I then tried 100 iterations on CPU:
cpu manual: 1st iteration: 10.8ms
next 100 iterations: 5.8ms
cpu auto_schedule (Adams19) 100 iterations: 7ms
Why is the first GPU iteration so slow? Why are subsequent runs almost the same speed on CPU and GPU?
I verified that it was generating the correct output image. I also tried setting input.set_host_dirty();
but it made no difference.
I tried auto scheduling on GPU using Anderson2021
but got the following error:
C:\Users\cordo\source\repos\camera_pipe17\out\build\x64-Debug\camera_pipe_auto_schedule.runtime.lib(camera_pipe_auto_schedule.runtime.obj) : error LNK2005: .weak._ZN6Halide7Runtime8Internal13custom_mallocE.default.halide_internal_aligned_alloc already defined in camera_pipe.runtime.lib(camera_pipe.runtime.obj)
There were several more similar looking errors. Thanks
Upvotes: 1
Views: 121
Reputation: 1436
The camera pipe app is written to exploit the kinds of fixed-point instructions that exist on CPUs, but not GPUs, so GPUs are going to be pretty bad at it compared to most other apps. On my machine it's 1.5ms on my CPU (i9-9960X), and 1.15ms on my GPU (RTX 2060). As others have said, the first iteration is slow because that's when Halide initializes the cuda library and compiles all the shaders.
For cuda_mat_mul, nvidia has spent a lot of engineer hours to write optimized matrix multiplies for every GPU. Halide's is tuned for an RTX 2060. On that GPU my timings are 0.69ms for Halide and 0.51ms for cublas.
Make sure you run with the environment variable HL_CUDA_JIT_MAX_REGISTERS=256 set. Matrix multiply is very sensitive to the number of registers available, to the point where it was worth hurting occupancy to get more. In fact it's so sensitive to register availability, that I think to do better I'd need to write in SASS directly, or at least convince LLVM to not reorder loads so far before the uses of those values to shrink the live ranges a bit.
Upvotes: 1