Leherenn
Leherenn

Reputation: 540

OpenCL: Move data between __global memory

I am trying to move some data between 2 global memory before running a kernel on it.
Here buffer contains data that needs to be written in array, but sadly not contiguously:

void exchange_2_halo_write(
    __global float2 *array,
    __global float *buffer,
    const unsigned int im,
    const unsigned int jm,
    const unsigned int km
) {
const unsigned int v_dim = 2;
unsigned int i, j, k, v, i_buf = 0;

// Which vector component, ie along v_dim
for (v = 0; v < v_dim; v++) {
    // top halo
    for (k = 0; k < km; k++) {
        for (i = 0; i < im; i++) {
            ((__global float*)&array[i + k*im*jm])[v] = buffer[i_buf];
            i_buf++;
        }
    }
    // bottom halo
    for (k = 0; k < km; k++) {
        for (i = 0; i < im; i++) {
            ((__global float*)&array[i + k*im*jm + im*(jm-1)])[v] = buffer[i_buf];
            i_buf++;
        }
    }
    // left halo
    for (k = 0; k < km; k++) {
        for (j = 1; j < jm-1; j++) {
            ((__global float*)&array[j*im + k*im*jm])[v] = buffer[i_buf];
            i_buf++;
        }
    }
    // right halo
    for (k = 0; k < km; k++) {
        for (j = 1; j < jm-1; j++) {
            ((__global float*)&array[j*im + k*im*jm + (im-1)])[v] = buffer[i_buf];
            i_buf++;
        }
    }
}
}

This works really fine in C (with a few minor changes), and for the data size I need (im = 150, jm = 150, km = 90, buf_sz = 107280), it runs in about 0.02s.
I had expected the same code to be slower on the GPU, but not that slower, it actually takes about 90 minutes to do the same thing (that's about 250000x slower!).

Simply doing a straight allocation takes about 15 minutes, which clearly shows it is not the way to go.

for (i = 0; i < buf_sz; i++) {
    array[i] = buffer[i];
}

In that case, I have seen that I can do something like this:

int xid = get_global_id(0);
array[xid] = buffer[xid];

which seems to work fine/quickly.
However, I do not know how to adapt this to use the conditions I have in the first code.

The top and bottom_halo parts have im contiguous elements to transfer to array, which I think means it could be ok to transfer easily. Sadly the left and right_halos don't.

Also with better code, can I expect to get somewhat close to the CPU time? If it is impossible to do it in, say, under 1s, it's probably going to be a waste.

Thank you.

Upvotes: 0

Views: 216

Answers (1)

DarkZeros
DarkZeros

Reputation: 8410

Before the answer, 1 remark. When you do a for loop inside a kernel, like this:

for (i = 0; i < buf_sz; i++) {
    array[i] = buffer[i];
}

And you launch ie: 512 work items, you are doing the copy 512 times!!, not doing it in parallel with 512 threads. So obviously, it is going to be even slower! more than 512x slower!!!


That said, you can split it in this way:

2D Global size: km x max(im,jm)

void exchange_2_halo_write(
    __global float2 *array,
    __global float *buffer,
    const unsigned int im,
    const unsigned int jm
) {
const unsigned int v_dim = 2;
const unsigned int k = get_global_id(0);
const unsigned int i = get_global_id(1);
const unsigned int km = get_global_size(0);

// Which vector component, ie along v_dim
for (unsigned int v = 0; v < v_dim; v++) {
    if(i < im){
        // top halo
        ((__global float*)&array[i + k*im*jm])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*i];
        // bottom halo
        ((__global float*)&array[i + k*im*jm + im*(jm-1)])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im+km*i];
    }

    if(i < jm-1 && i > 0){
        // left halo
        ((__global float*)&array[i*im + k*im*jm])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im*2+km*(i-1)];
        // right halo
        ((__global float*)&array[i*im + k*im*jm + (im-1))[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im*2+km*(jm-2)+km*(i-1)];
    }
}
}

Other options are possible, like using local memory, but that is a tedious work....

Upvotes: 1

Related Questions