cubiclewar
cubiclewar

Reputation: 1579

OpenCL - Global Memory reads preforming better than local

I have a kernel which I am running on a NVidia GTX 680 that increased in execution time when switching from using global memory to local memory.

My kernel which is part of a finite element ray tracer now loads each element into local memory before processing. The data for each element is stored in a struct fastTriangle which has the following definition :

typedef struct fastTriangle {
    float cx, cy, cz, cw;
    float nx, ny, nz, nd;
    float ux, uy, uz, ud;
    float vx, vy, vz, vd;
} fastTriangle;

I pass an array of these object to the kernel which is written as follows (I have removed the irrelevant code for brevity:

__kernel void testGPU(int n_samples, const int n_objects, global const fastTriangle *objects, __local int *x_res, __global int *hits) {
    // Get gid, lid, and lsize

    // Set up random number generator and thread variables

    // Local storage for the two triangles being processed
    __local fastTriangle triangles[2]; 

    for(int i = 0; i < n_objects; i++) {    // Fire ray from each object
        event_t evt = async_work_group_copy((local float*)&triangles[0], (global float*)&objects[i],sizeof(fastTriangle)/sizeof(float),0);

        //Initialise local memory x_res to 0's

        barrier(CLK_LOCAL_MEM_FENCE);
        wait_group_events(1, &evt);      


        Vector wsNormal = { triangles[0].cw*triangles[0].nx, triangles[0].cw*triangles[0].ny, triangles[0].cw*triangles[0].nz};

        for(int j = 0; j < n_samples; j+= 4) {
            // generate a float4 of random numbers here (rands

            for(int v = 0; v < 4; v++) {    // For each ray in ray packet
                //load the first object to be intesected
                evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[0],sizeof(fastTriangle)/sizeof(float),0);

                // Some initialising code and calculate ray here
                // Should have ray fully specified at this point;


                for(int w = 0; w < n_objects; w++) {        // Check for intersection against each ray

                    wait_group_events(1, &evt);

                    // Check for intersection against object w


                    float det = wsDir.x*triangles[1].nx + wsDir.y*triangles[1].ny + wsDir.z*triangles[1].nz;
                    float dett = triangles[1].nd - (triangles[0].cx*triangles[1].nx + triangles[0].cy*triangles[1].ny + triangles[0].cz*triangles[1].nz);


                    float detpx = det*triangles[0].cx + dett*wsDir.x;
                    float detpy = det*triangles[0].cy + dett*wsDir.y;
                    float detpz = det*triangles[0].cz + dett*wsDir.z;


                    float detu = detpx*triangles[1].ux + detpy*triangles[1].uy + detpz*triangles[1].uz + det*triangles[1].ud;
                    float detv = detpx*triangles[1].vx + detpy*triangles[1].vy + detpz*triangles[1].vz + det*triangles[1].vd;


                    // Interleaving the copy of the next triangle
                    evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[w+1],sizeof(fastTriangle)/sizeof(float),0);

                    // Complete intersection calculations

                } // end for each object intersected

                if(objectNo != -1) atomic_inc(&x_res[objectNo]);
            } // end for sub rays
        } // end for each ray
        barrier(CLK_LOCAL_MEM_FENCE);

        // Add all the local x_res to global array hits


        barrier(CLK_GLOBAL_MEM_FENCE);
    } // end for each object
}

When I first wrote this kernel I did not buffer each object in local memory and instead just accessed it form global memory i.e instead of triangles[0].cx I would use objects[i].cx

When setting out to optimise I switched to using local memory as listed above but then observed a execution run time increase of around 25%.

Why would performance be worse when using local memory to buffer the objects instead of directly accessing them in global memory?

Upvotes: 1

Views: 1207

Answers (1)

Tomas
Tomas

Reputation: 235

It really depends on your program if local memory helps you to run faster. There are two things to consider when using local memory:

  1. you have additional computation when copying the data from global to local and from local to global again.

  2. I see that you have 3 times "barrier(...)", these barriers are performance killers. All OpenCL tasks have to wait at the barrier for all others. This way the parallelism is hindered and the tasks don't run independent any more.

Local memory is great when you read data lots of times in your computation. But the fast reads and writes need to get you more performance gain than the copying and synchronizing takes.

Upvotes: 2

Related Questions