How do I translate this ACC code to SYCL?

My question is:

I have this code:

    #pragma acc parallel loop
    for(i=0; i<bands; i++)
    {
        #pragma acc loop seq
        for(j=0; j<lines_samples; j++)
            r_m[i] += image_vector[i*lines_samples+j];

        r_m[i] /= lines_samples;

        #pragma acc loop
        for(j=0; j<lines_samples; j++)
            R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
    }

I'm trying to translate it to SYCL, and I thought about putting a kernel substituting the first parallel loop, with the typical "queue.submit(...)", using "i". But then I realized that inside the first big loop there is a loop that must be executed in serial. Is there a way to tell SYCL to execute a loop inside a kernel in serial?

I can't think of another way to solve this, as I need to make both the first big for and the last for inside the main one parallel.

Thank you in advance.

Upvotes: 1

Answers (1)

Joe Todd

Reputation: 897

You have a couple of options here. The first one, as you suggest, is to create a kernel with a 1D range over i:

q.submit([&](sycl::handler &cgh){
  cgh.parallel_for(sycl::range<1>(bands), [&](sycl::item<1> i){
    for(j=0; j<lines_samples; j++)
        r_m[i] += image_vector[i*lines_samples+j];

    r_m[i] /= lines_samples;

    for(j=0; j<lines_samples; j++)
        R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];

  })
});

Note that for the inner loops, the kernel will just iterate serially over j in both cases. SYCL doesn't apply any magic to your loops like a #pragma would - loops are loops.

This is fine, but you're missing out on a higher degree of parallelism which could be achieved by writing a kernel with a 2D range over i and j: sycl::range<2>(bands, lines_samples). This can be made to work relatively easily, assuming your first loop is doing what I think it's doing, which is computing the average of a line of an image. In this case, you don't really need a serial loop - you can achieve this using work-groups.

Work-groups in SYCL have access to fast on-chip shared memory, and are able to synchronise. This means that you can have a work-group load all the pixels from a line of your image, then the work-group can collaboratively compute the average of that line, synchronize, then each member of the work-group uses the computed average to compute a single value of R_o, your output. This approach maximises available parallelism.

The collaborative reduction operation to find the average of the given line is probably best achieved through tree-reduction. Here are a couple of guides which go through this workgroup reduction approach:

https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels/reduction.html

Upvotes: 2

How do I translate this ACC code to SYCL?

Answers (1)

Related Questions