Shashwat
Shashwat

Reputation: 45

Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code


I'm new to openCL and willing to compare performance gain between C code and openCL kernels. Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:

  1. Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)

    ret |= clFinish(command_queue); //Empty the queue
    getTime(&begin);
    ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL);  //Profiling Disabled.
    ret |= clFinish(command_queue);
    getTime(&end);
    g_NDRangePureExecTimeSec = elapsed_time(&begin, &end);      //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
    
  2. Using events profiling:

    ret = clEnqueueMarker(command_queue, &evt1);
    //Empty the Queue
    ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
    ret |= clWaitForEvents(1, &evt1);
    ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
    ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
    g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE);  //nSec to Sec
    ret |= clReleaseEvent(evt1);
    

Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:

    __kernel void filter_rows(__global float *ip_img,\
                              __global float *op_img, \
                              int width, int height, \
                              int pitch,int N, \
                              __constant float *W)
    {
        __private int i=get_global_id(0); 
        __private int j=get_global_id(1); 
        __private int k;
        __private float a;
        __private int image_offset = N*pitch +N;
        __private int curr_pix = j*pitch + i +image_offset;

        // apply filter
        a  = ip_img[curr_pix-8] * W[0 ];    
        a += ip_img[curr_pix-7] * W[1 ];    
        a += ip_img[curr_pix-6] * W[2 ];    
        a += ip_img[curr_pix-5] * W[3 ];    
        a += ip_img[curr_pix-4] * W[4 ];    
        a += ip_img[curr_pix-3] * W[5 ];    
        a += ip_img[curr_pix-2] * W[6 ];    
        a += ip_img[curr_pix-1] * W[7 ];    
        a += ip_img[curr_pix-0] * W[8 ];    
        a += ip_img[curr_pix+1] * W[9 ];    
        a += ip_img[curr_pix+2] * W[10];    
        a += ip_img[curr_pix+3] * W[11];    
        a += ip_img[curr_pix+4] * W[12];    
        a += ip_img[curr_pix+5] * W[13];    
        a += ip_img[curr_pix+6] * W[14];    
        a += ip_img[curr_pix+7] * W[15];    
        a += ip_img[curr_pix+8] * W[16];
        // write output
        op_img[curr_pix] = (float)a;
    }

And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2. However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.

So my questions are:

  1. What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
  2. Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?

Upvotes: 2

Views: 803

Answers (2)

vijayky88
vijayky88

Reputation: 134

From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.

You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.

SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

Upvotes: 1

VAndrei
VAndrei

Reputation: 5570

I'm observing gain around 11x using method 1 and around 16x using method 2.

This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.

What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?

Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.

Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?

Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.

Upvotes: 3

Related Questions