Time recquired for OpenCL kernel deletion

Question

I'm encountering an unexpected performance with my OpenCL code (more precisely, I use boost::compute 1.67.0). For now, I just want to add each elements of 2 buffers c[i] = a[i] + b[i]. I noticed some speed reduction in comparison of an existing SIMD implementation so I isolated each step to highlight which one is time consuming. Here is my code sample :

    Chrono chrono2;
    chrono2.start();
    Chrono chrono;
    ipReal64 elapsed;

    // creating the OpenCL context and other stuff
    // ...
    
    std::string kernel_src = BOOST_COMPUTE_STRINGIZE_SOURCE(
        __kernel void add_knl(__global const uchar* in1, __global const uchar* in2, __global uchar* out)
    {
        size_t idx = get_global_id(0);
        out[idx] = in1[idx] + in2[idx];
    }
    );

    boost::compute::program* program = new boost::compute::program;
    try {
        chrono.start();
        *program = boost::compute::program::create_with_source(kernel_src, context);
        elapsed = chrono.elapsed();
        std::cout << "Create program : " << elapsed << "s" << std::endl;
        chrono.start();
        program->build();
        elapsed = chrono.elapsed();
        std::cout << "Build program : " << elapsed << "s" << std::endl;
    }
    catch (boost::compute::opencl_error& e) {
        std::cout << "Error building program : " << std::endl << program->build_log() << std::endl << e.what() << std::endl;
        return;
    }

    boost::compute::kernel* kernel = new boost::compute::kernel;
    try {
        chrono.start();
        *kernel = program->create_kernel("add_knl");
        elapsed = chrono.elapsed();
        std::cout << "Create kernel : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
        return;
    }

    try {
        chrono.start();
        // Pass the argument to the kernel
        kernel->set_arg(0, bufIn1);
        kernel->set_arg(1, bufIn2);
        kernel->set_arg(2, bufOut);
        elapsed = chrono.elapsed();
        std::cout << "Set args : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
        return;
    }

    try {

        chrono.start();
        queue.enqueue_1d_range_kernel(*kernel, 0, sizeX*sizeY, 0);
        elapsed = chrono.elapsed();
        std::cout << "Kernel calculation : " << elapsed << "s" << std::endl;
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
        return;
    }
    
    std::cout << "[Function] Full duration " << chrono2.elapsed() << std::endl;

    chrono.start();
    delete program;
    elapsed = chrono.elapsed();
    std::cout << "Delete program : " << elapsed << "s" << std::endl;

    delete kernel;
    elapsed = chrono.elapsed();
    std::cout << "Delete kernel  : " << elapsed << "s" << std::endl;

And here is a sample of result (I run my program on a NVidia GeForce GT 630, with NVidia SDK TookKit) :

Create program           : 0.0013123s
Build program            : 0.0015421s
Create kernel            : 6.6e-06s
Set args                 : 1.7e-06s
Kernel calculation       : 0.0001639s
[Function] Full duration : 0.0077794
Delete program           : 4.1e-06s
Delete kernel            : 0.0879901s

I know my program is simple and I don't expect having the kernel execution being the most time consumming step. However, I thought the kernel deletion would take only a few ms, such as creating or building the program.

Is this a normal behaviour?

Thanks

pmdj · Accepted Answer

I'll point out that I've never used boost::compute, but it looks like it's a fairly thin wrapper over OpenCL, so the following should be correct:

Enqueueing the kernel does not wait for it to complete. The enqueue function returns an event, which you can then wait for, or you can wait for all tasks enqueued onto the queue to complete. You are timing neither of those things. What is likely happening is that when you destroy your kernel, it waits for all queued instances which are still pending to complete before returning from the destructor.

Time recquired for OpenCL kernel deletion

Answers (1)

Related Questions