Reputation: 95
My question is:
I have this code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
I'm trying to translate it to SYCL, and I thought about putting a kernel substituting the first parallel loop, with the typical "queue.submit(...)", using "i". But then I realized that inside the first big loop there is a loop that must be executed in serial. Is there a way to tell SYCL to execute a loop inside a kernel in serial?
I can't think of another way to solve this, as I need to make both the first big for and the last for inside the main one parallel.
Thank you in advance.
Upvotes: 1
Views: 237
Reputation: 897
You have a couple of options here. The first one, as you suggest, is to create a kernel with a 1D range over i
:
q.submit([&](sycl::handler &cgh){
cgh.parallel_for(sycl::range<1>(bands), [&](sycl::item<1> i){
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
})
});
Note that for the inner loops, the kernel will just iterate serially over j
in both cases. SYCL doesn't apply any magic to your loops like a #pragma
would - loops are loops.
This is fine, but you're missing out on a higher degree of parallelism which could be achieved by writing a kernel with a 2D range over i
and j
: sycl::range<2>(bands, lines_samples)
. This can be made to work relatively easily, assuming your first loop is doing what I think it's doing, which is computing the average of a line of an image. In this case, you don't really need a serial loop - you can achieve this using work-groups.
Work-groups in SYCL have access to fast on-chip shared memory, and are able to synchronise. This means that you can have a work-group load all the pixels from a line of your image, then the work-group can collaboratively compute the average of that line, synchronize, then each member of the work-group uses the computed average to compute a single value of R_o
, your output. This approach maximises available parallelism.
The collaborative reduction operation to find the average of the given line is probably best achieved through tree-reduction. Here are a couple of guides which go through this workgroup reduction approach:
https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples
Upvotes: 2