PyOpenCL performance issue on VideoCoreIV VC4CL (Raspberry Pi GPU)

Question

i'm new in OpenCL/PyOpenCL and i'm trying to understand how OpenCL on Raspberry GPU (VideoCoreIV) compare to Numpy (on CPU) in vector and matrix multiplications on my hardware.

I'm using VC4CL as implementation of OpenCL 1.2 for the VideoCore IV GPU. (https://github.com/doe300/VC4CL)

I'm getting terrible results and i can't understand why, i dont know if the problem is configuration related, code related, or i simply cannot get advantage with OpenCL on Numpy with this kind of stuffs.

This is my "clinfo":

Number of platforms                               1
  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
  Platform Vendor                                 doe300
  Platform Version                                OpenCL 1.2 VC4CL 0.4
  Platform Profile                                EMBEDDED_PROFILE
  Platform Extensions                             cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
  Platform Extensions function suffix             VC4CL

  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices                                 1
  Device Name                                     VideoCore IV GPU
  Device Vendor                                   Broadcom
  Device Vendor ID                                0xa5c
  Device Version                                  OpenCL 1.2 VC4CL 0.4
  Driver Version                                  0.4
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  EMBEDDED_PROFILE
  Max compute units                               1
  Max clock frequency                             300MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             12x12x12
  Max work group size                             12
  Preferred work group size multiple              1
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                               16 / 16      
    int                                                 16 / 16      
    long                                                 0 / 0       
    half                                                 0 / 0        (n/a)
    float                                               16 / 16      
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              79691776 (76MiB)
  Error Correction support                        No
  Max memory allocation                           79691776 (76MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             64 bytes
  Alignment of base address                       512 bits (64 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        
  Global Memory cache line                        64 bytes
  Image support                                   No
  Local memory type                               Global
  Local memory size                               79691776 (76MiB)
  Max constant buffer size                        79691776 (76MiB)
  Max number of constant args                     64
  Max size of kernel argument                     256
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            0
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_int16

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  OpenCL for the Raspberry Pi VideoCore IV GPU
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [VC4CL]
  clCreateContext(NULL, ...) [default]            Success [VC4CL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1

This is an example of what i'm testing:

import pyopencl as cl  # Import the OpenCL GPU computing API
import numpy  # Import tools to work with numbers
from time import time  # Import access to the current time XXXXX IMPROVE

a = numpy.random.rand(243,243).astype(numpy.float32)  # Create a random array to add
b = numpy.random.rand(243,243).astype(numpy.float32) 
c = numpy.empty_like(a)


def gpu_matrix_mul(a, b):
    gpu_context_time = time()
    context = cl.create_some_context()  # Initialize the Context (One Per-Computer)
    gpu_context_end_time = time()
    print("GPU context Time: {0} s".format(gpu_context_end_time - gpu_context_time)) 
    queue = cl.CommandQueue(context, properties=cl.command_queue_properties.PROFILING_ENABLE)  # Instantiate a Queue (One-Per Device) with profiling (timing) enabled
    gpu_queue_time = time()
    print("GPU queque Time: {0} s".format(gpu_queue_time - gpu_context_end_time)) 
    a_buffer = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a)
    b_buffer = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=b)
    c_buffer = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, c.nbytes)
    gpu_buffer_end_time = time()
    print("GPU buffer Time: {0} s".format(gpu_buffer_end_time - gpu_queue_time)) 
    #__kernel void sum(__global const float *a, __global const float *b, __global float *c)

    prg = cl.Program(context, """
    __kernel void multiplymatrices(const unsigned int size, __global float * matrix1, __global float * matrix2, __global float * res) {

    int i = get_global_id(1); 
    int j = get_global_id(0);

    res[i + size * j] = 0;

    for (int k = 0; k < size; k++)
    {
        res[i + size * j] += matrix1[i + size * k] * matrix2[k + size * j];
    }

    }
    """).build()


    gpu_build_end_time = time()
    print("GPU build Time: {0} s".format(gpu_build_end_time - gpu_buffer_end_time)) 
    gpu_start_time = time()  # Get the GPU start time
    event = prg.multiplymatrices(queue, a.shape, (3,3),numpy.int32(len(a)) ,a_buffer, b_buffer, c_buffer)
    gpu_end_time = time()
    elapsed = gpu_end_time - gpu_start_time  # Calculate the time it took to execute the kernel
    print("GPU Kernel 1 Time: {0} s".format(elapsed))
    event.wait()  # Wait until the event finishes XXX
    gpu_end_time = time()
    elapsed = gpu_end_time - gpu_start_time  # Calculate the time it took to execute the kernel
    print("GPU Kernel 2 Time: {0} s".format(elapsed))  # Print the time it took to execute the kernel
    #c_gpu = numpy.empty_like(a)  # Create an empty array the same size as array a
    #cl.enqueue_read_buffer(queue, c_buffer, c_gpu).wait() # Read back the data from GPU memory into array c_gpu
    cl.enqueue_copy(queue,c,c_buffer)
    gpu_end_time = time()  # Get the GPU end time
    print("GPU Time: {0} s".format(gpu_end_time - gpu_start_time))  # Print the time the GPU program took, including both memory copies
    return c  # Return the sum of the two arrays


gpu_matrix_mul(a, b)  # Call the function that sums two arrays on the GPU

this is the output:

GPU context Time: 1.4038372039794922 s
GPU queque Time: 0.0018715858459472656 s
GPU buffer Time: 0.005632877349853516 s
GPU build Time: 0.0621495246887207 s
GPU Kernel 1 Time: 0.016644001007080078 s
GPU Kernel 2 Time: 3.7788493633270264 s
GPU Time: 3.7811059951782227 s
array([[ 60.82688141,  63.95470428,  62.64150238, ...,  63.28399658,
         56.93241882,  61.31788254],
       [ 59.43152237,  56.5719986 ,  58.83155823, ...,  61.03038788,
         52.9797554 ,  55.83972931],
       [ 63.17213821,  60.47645187,  65.15206146, ...,  65.67092896,
         58.11833954,  59.12028885],
       ..., 
       [ 62.8201561 ,  67.18665314,  67.1701889 , ...,  69.14107513,
         58.58791733,  64.60624695],
       [ 65.78559875,  65.23566437,  68.32820129, ...,  68.67667389,
         60.40095901,  62.51589203],
       [ 58.60590744,  59.03076172,  60.83581543, ...,  62.88612747,
         57.20410156,  59.33882904]], dtype=float32)

The same product, done with Numpy on cpu ( numpy.matmul(a,b) ):

Numpy Time: 0.18232202529907227 s
[[ 60.82718277  63.95497513  62.64178467 ...,  63.2842865   56.93268204
   61.31820297]
 [ 59.43178177  56.57228088  58.83188629 ...,  61.03063965  52.98002625
   55.84001541]
 [ 63.17245483  60.47675323  65.15237427 ...,  65.67124176  58.11859894
   59.12057877]
 ..., 
 [ 62.8204689   67.18702698  67.17053223 ...,  69.14141846  58.58823013
   64.60652924]
 [ 65.78593445  65.23597717  68.32857513 ...,  68.67705536  60.40128708
   62.51620865]
 [ 58.60623169  59.03105164  60.83609009 ...,  62.88640976  57.20439148
   59.33910751]]

I can't understand why i'm getting this performance with OpenCL. Any suggestions and help are welcome.

PyOpenCL performance issue on VideoCoreIV VC4CL (Raspberry Pi GPU)

Answers (1)

Related Questions