Reputation: 11
i'm new in OpenCL/PyOpenCL and i'm trying to understand how OpenCL on Raspberry GPU (VideoCoreIV) compare to Numpy (on CPU) in vector and matrix multiplications on my hardware.
I'm using VC4CL as implementation of OpenCL 1.2 for the VideoCore IV GPU. (https://github.com/doe300/VC4CL)
I'm getting terrible results and i can't understand why, i dont know if the problem is configuration related, code related, or i simply cannot get advantage with OpenCL on Numpy with this kind of stuffs.
This is my "clinfo":
Number of platforms 1
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Platform Vendor doe300
Platform Version OpenCL 1.2 VC4CL 0.4
Platform Profile EMBEDDED_PROFILE
Platform Extensions cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
Platform Extensions function suffix VC4CL
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices 1
Device Name VideoCore IV GPU
Device Vendor Broadcom
Device Vendor ID 0xa5c
Device Version OpenCL 1.2 VC4CL 0.4
Driver Version 0.4
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile EMBEDDED_PROFILE
Max compute units 1
Max clock frequency 300MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Max work item dimensions 3
Max work item sizes 12x12x12
Max work group size 12
Preferred work group size multiple 1
Preferred / native vector sizes
char 16 / 16
short 16 / 16
int 16 / 16
long 0 / 0
half 0 / 0 (n/a)
float 16 / 16
double 0 / 0 (n/a)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero Yes
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 79691776 (76MiB)
Error Correction support No
Max memory allocation 79691776 (76MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 64 bytes
Alignment of base address 512 bits (64 bytes)
Global Memory cache type Read/Write
Global Memory cache size <printDeviceInfo:89: get CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : error -30>
Global Memory cache line 64 bytes
Image support No
Local memory type Global
Local memory size 79691776 (76MiB)
Max constant buffer size 79691776 (76MiB)
Max number of constant args 64
Max size of kernel argument 256
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 0
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_int16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) OpenCL for the Raspberry Pi VideoCore IV GPU
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [VC4CL]
clCreateContext(NULL, ...) [default] Success [VC4CL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Device Name VideoCore IV GPU
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Device Name VideoCore IV GPU
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
This is an example of what i'm testing:
import pyopencl as cl # Import the OpenCL GPU computing API
import numpy # Import tools to work with numbers
from time import time # Import access to the current time XXXXX IMPROVE
a = numpy.random.rand(243,243).astype(numpy.float32) # Create a random array to add
b = numpy.random.rand(243,243).astype(numpy.float32)
c = numpy.empty_like(a)
def gpu_matrix_mul(a, b):
gpu_context_time = time()
context = cl.create_some_context() # Initialize the Context (One Per-Computer)
gpu_context_end_time = time()
print("GPU context Time: {0} s".format(gpu_context_end_time - gpu_context_time))
queue = cl.CommandQueue(context, properties=cl.command_queue_properties.PROFILING_ENABLE) # Instantiate a Queue (One-Per Device) with profiling (timing) enabled
gpu_queue_time = time()
print("GPU queque Time: {0} s".format(gpu_queue_time - gpu_context_end_time))
a_buffer = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a)
b_buffer = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=b)
c_buffer = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, c.nbytes)
gpu_buffer_end_time = time()
print("GPU buffer Time: {0} s".format(gpu_buffer_end_time - gpu_queue_time))
#__kernel void sum(__global const float *a, __global const float *b, __global float *c)
prg = cl.Program(context, """
__kernel void multiplymatrices(const unsigned int size, __global float * matrix1, __global float * matrix2, __global float * res) {
int i = get_global_id(1);
int j = get_global_id(0);
res[i + size * j] = 0;
for (int k = 0; k < size; k++)
{
res[i + size * j] += matrix1[i + size * k] * matrix2[k + size * j];
}
}
""").build()
gpu_build_end_time = time()
print("GPU build Time: {0} s".format(gpu_build_end_time - gpu_buffer_end_time))
gpu_start_time = time() # Get the GPU start time
event = prg.multiplymatrices(queue, a.shape, (3,3),numpy.int32(len(a)) ,a_buffer, b_buffer, c_buffer)
gpu_end_time = time()
elapsed = gpu_end_time - gpu_start_time # Calculate the time it took to execute the kernel
print("GPU Kernel 1 Time: {0} s".format(elapsed))
event.wait() # Wait until the event finishes XXX
gpu_end_time = time()
elapsed = gpu_end_time - gpu_start_time # Calculate the time it took to execute the kernel
print("GPU Kernel 2 Time: {0} s".format(elapsed)) # Print the time it took to execute the kernel
#c_gpu = numpy.empty_like(a) # Create an empty array the same size as array a
#cl.enqueue_read_buffer(queue, c_buffer, c_gpu).wait() # Read back the data from GPU memory into array c_gpu
cl.enqueue_copy(queue,c,c_buffer)
gpu_end_time = time() # Get the GPU end time
print("GPU Time: {0} s".format(gpu_end_time - gpu_start_time)) # Print the time the GPU program took, including both memory copies
return c # Return the sum of the two arrays
gpu_matrix_mul(a, b) # Call the function that sums two arrays on the GPU
this is the output:
GPU context Time: 1.4038372039794922 s
GPU queque Time: 0.0018715858459472656 s
GPU buffer Time: 0.005632877349853516 s
GPU build Time: 0.0621495246887207 s
GPU Kernel 1 Time: 0.016644001007080078 s
GPU Kernel 2 Time: 3.7788493633270264 s
GPU Time: 3.7811059951782227 s
array([[ 60.82688141, 63.95470428, 62.64150238, ..., 63.28399658,
56.93241882, 61.31788254],
[ 59.43152237, 56.5719986 , 58.83155823, ..., 61.03038788,
52.9797554 , 55.83972931],
[ 63.17213821, 60.47645187, 65.15206146, ..., 65.67092896,
58.11833954, 59.12028885],
...,
[ 62.8201561 , 67.18665314, 67.1701889 , ..., 69.14107513,
58.58791733, 64.60624695],
[ 65.78559875, 65.23566437, 68.32820129, ..., 68.67667389,
60.40095901, 62.51589203],
[ 58.60590744, 59.03076172, 60.83581543, ..., 62.88612747,
57.20410156, 59.33882904]], dtype=float32)
The same product, done with Numpy on cpu ( numpy.matmul(a,b) ):
Numpy Time: 0.18232202529907227 s
[[ 60.82718277 63.95497513 62.64178467 ..., 63.2842865 56.93268204
61.31820297]
[ 59.43178177 56.57228088 58.83188629 ..., 61.03063965 52.98002625
55.84001541]
[ 63.17245483 60.47675323 65.15237427 ..., 65.67124176 58.11859894
59.12057877]
...,
[ 62.8204689 67.18702698 67.17053223 ..., 69.14141846 58.58823013
64.60652924]
[ 65.78593445 65.23597717 68.32857513 ..., 68.67705536 60.40128708
62.51620865]
[ 58.60623169 59.03105164 60.83609009 ..., 62.88640976 57.20439148
59.33910751]]
I can't understand why i'm getting this performance with OpenCL. Any suggestions and help are welcome.
Upvotes: 1
Views: 868
Reputation: 1129
Wildly guessing it's any of: 1) memory access pattern is suboptimal for the GPU, 2) your matrix multiplication code is unvectorized and the compiler is unable to vectorize it (AFAIK the RPi GPU needs vectorized code to perform reasonably), 3) the "numpy.matmul" is likely heavily optimized code, while your matrix multiplication is very unoptimized code. If you want to see some reasonable GPU code for matrix multiplication, this is a good starting point.
Upvotes: 1