Reputation: 105

clBuildProgram gets stuck with nested loops

clBuildProgram seems to get stuck without any error message when trying this kind of .cl-file:

__local int bar(int a, int b, int c, int d, int e)
{
    return a*b*c*d; // 'e' not used
}

__kernel void foobar(__global int * notusedvariable)
{
    int foo=1;
    for (int a=1; a<=10; a++)
        for (int b=1; b<=10; b++)
            for (int c=1; c<=10; c++)
                for (int d=1; d<=10; d++)
                     for (int e=1; e<=10; e++)
                         foo *= bar(a,b,c,d,e);
}

When I remove innermost loop and change foo *= bar(a,b,c,d,e); to foo *= bar(a,b,c,d,1); it compiles. So there is some kind of over-optimization or over-precalculation going on. This also happens if I have more loops and some of the variables are taken from get_global_id(...).

What can I do?

I use Fedora Linux 20, and have installed

opencl-utils-0-12.svn16.fc20.x86_64
opencl-1.2-intel-cpu-3.2.1.16712-1.x86_64
opencl-utils-devel-0-12.svn16.fc20.x86_64
opencl-1.2-base-3.2.1.16712-1.x86_64

GPU is Geforce 210, i.e. the cheapest one that I could find.

Upvotes: 0

Answers (3)

Marco13

Reputation: 54679

It is not really "stuck". It is just trapped in a hell of attempts to optimize the kernel. Primarily by unrolling the loops with the fixed size (and BTW, by finding out that the foo variable is not used at all!)

For example, when the loops a...d are enabled (and e switched off), then the binaries that are created for the kernel look like this:

.entry foobar(
    .param .u32 .ptr .global .align 4 foobar_param_0
)
{
    .reg .pred  %p<4>;
    .reg .s32   %r<13>;


    mov.u32     %r10, 0;

BB0_1:
    add.s32     %r10, %r10, 1;
    mov.u32     %r11, 0;

BB0_2:
    mov.u32     %r12, 10;

BB0_3:
    add.s32     %r12, %r12, -2;
    setp.ne.s32     %p1, %r12, 0;
    @%p1 bra    BB0_3;

    add.s32     %r11, %r11, 1;
    setp.ne.s32     %p2, %r11, 10;
    @%p2 bra    BB0_2;

    setp.ne.s32     %p3, %r10, 10;
    @%p3 bra    BB0_1;

    ret;
}

You can see that it is not really computing anyhting - and the compiler already has a hard time finding out that there is actually nothing to do.

Compare this to the output that is generated when you add the line

notusedvariable[0]=foo;

as the last line of the kernel: Now, the computations can not be skipped and optimized away. After quite a while of compiling, it produces the result

.entry foobar(
    .param .u32 .ptr .global .align 4 foobar_param_0
)
{
    .reg .pred  %p<4>;
    .reg .s32   %r<80>;


    mov.u32     %r79, 1;
    mov.u32     %r73, 0;
    mov.u32     %r72, %r73;

BB0_1:
    add.s32     %r7, %r73, 1;
    add.s32     %r72, %r72, 2;
    mov.u32     %r76, 0;
    mov.u32     %r74, %r76;
    mov.u32     %r73, %r7;
    mov.u32     %r75, %r7;

BB0_2:
    mov.u32     %r9, %r75;
    add.s32     %r74, %r74, %r72;
    mov.u32     %r78, 10;
    mov.u32     %r77, 0;

BB0_3:
    add.s32     %r40, %r9, %r77;
    mul.lo.s32  %r41, %r40, %r79;
    mul.lo.s32  %r42, %r40, %r41;
    add.s32     %r43, %r74, %r77;
    mul.lo.s32  %r53, %r42, %r40;
    mul.lo.s32  %r54, %r53, %r40;
    mul.lo.s32  %r55, %r54, %r40;
    mul.lo.s32  %r56, %r55, %r40;
    mul.lo.s32  %r57, %r56, %r40;
    mul.lo.s32  %r58, %r57, %r40;
    mul.lo.s32  %r59, %r58, %r40;
    mul.lo.s32  %r60, %r59, %r40;
    mul.lo.s32  %r61, %r60, %r43;
    mul.lo.s32  %r62, %r61, %r43;
    mul.lo.s32  %r63, %r62, %r43;
    mul.lo.s32  %r64, %r63, %r43;
    mul.lo.s32  %r65, %r64, %r43;
    mul.lo.s32  %r66, %r65, %r43;
    mul.lo.s32  %r67, %r66, %r43;
    mul.lo.s32  %r68, %r67, %r43;
    mul.lo.s32  %r69, %r68, %r43;
    mul.lo.s32  %r70, %r69, %r43;
    mul.lo.s32  %r79, %r70, -180289536;
    add.s32     %r77, %r77, %r74;
    add.s32     %r78, %r78, -2;
    setp.ne.s32     %p1, %r78, 0;
    @%p1 bra    BB0_3;

    add.s32     %r76, %r76, 1;
    add.s32     %r30, %r9, %r7;
    setp.ne.s32     %p2, %r76, 10;
    mov.u32     %r75, %r30;
    @%p2 bra    BB0_2;

    setp.ne.s32     %p3, %r7, 10;
    @%p3 bra    BB0_1;

    ld.param.u32    %r71, [foobar_param_0];
    st.global.u32   [%r71], %r79;
    ret;
}

Obviously, it has unrolled some of the loops, now that he could not optimize them away any more. I assume that when loop "e" is also activated, the time that is required for this sort of unrolling (or for optimizing away the unused loops) increases at least quadratically. So if you give him a few hours, he might actually finish the compilation as well....

As Tom Fenech already said in https://stackoverflow.com/a/22011454 , this problem can be alleviated by passing -cl-opt-disable to clBuildProgram.

Alternatively, you can selectively switch off the unrolling optimization for each loop: When you insert

#pragma unroll 1

directly before a for-loop, you are effectively disabling unrolling for this particular loop.

Important Don't blindly insert the unroll pragma with arbitrary values. Using 1 is safe, but for other values, you have to manually make sure that it does not affect the correctness of the program. See the CUDA programming guide, section "B.21. #pragma unroll".

In this case, it seems to be sufficient to insert #pragma unroll 1 before the two innermost loops (d and e) in order to enable enough of the optimization to quickly build the program.

EDIT: sigh prunge was 4 minutes faster... :-(

Upvotes: 2

prunge

Reputation: 23268

The Nvidia OpenCL compiler might be performing loop unrolling. If it does that for each of your nested loops that will result in a lot of code being generated!

There is the cl_nv_pragma_unroll Nvidia specific OpenCL extension that can be used to have greater control over loop unrolling. To quote its documentation:

A user may specify that a loop in the source program be unrolled. This is done via a pragma. The syntax of this pragma is as follows

#pragma unroll [unroll-factor]

The pragma unroll may optionally specify an unroll factor. The pragma must be placed immediately before the loop and only applies to that loop.

If unroll factor is not specified then the compiler will try to do complete or full unrolling of the loop. If a loop unroll factor is specified the compiler will perform partial loop unrolling. The loop factor, if specified, must be a compile time non negative integer constant.

A loop unroll factor of 1 means that the compiler should not unroll the loop.

A complete unroll specification has no effect if the trip count of the loop is not compile-time computable.

By default, it sounds like it will unroll for certain low maximum limits (e.g. for 10 in your example) but will likely still unroll if they are nested (the unroll checking logic is probably not sophisticated enough to check for nested loops).

You could try #pragma unroll 1 to disable unrolling:

int foo=1;
#pragma unroll 1
for (int a=1; a<=10; a++)
    #pragma unroll 1
    for (int b=1; b<=10; b++)
        #pragma unroll 1
        for (int c=1; c<=10; c++)
           #pragma unroll 1
           for (int d=1; d<=10; d++)
              #pragma unroll 1
              for (int e=1; e<=10; e++)
                 foo *= bar(a,b,c,d,e);

You'll also want to enable the extension by putting:

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

at the top of your OpenCL source file as well.

Upvotes: 0

Tom Fenech

Reputation: 74685

The calculation that you are doing is going to get a very large number indeed!

I get the same problem as you on my hardware (NVIDIA GTX480) but I don't think this is hardware dependent. You are simply generating a number that is known at compile time and is too large to pre-compute for the int variable type. I changed int to long and the program now builds.

edit

I just tried this, using the Intel platform. It compiles fine. You can also make it work on NVIDIA by passing the switch -cl-opt-disable to clBuildProgram. This disables all optimisations - you may have some luck with some of the other compiler switches. See the clBuildProgram reference for details of those.

Upvotes: 0

clBuildProgram gets stuck with nested loops

Answers (3)

edit

Related Questions