Dani Grosu
Dani Grosu

Reputation: 554

CUDA - PTX carry propagation

I want to add two 32-bit unsigned integers in CUDA PTX and I also want to take care of the carry propagation. I am using the code below to do that, but the result is not as expected.
Acording to the documentation, the add.cc.u32 d, a, b performs integer addition and writes the carry-out value into the condition code register, that is CC.CF.
On the other hand, addc.cc.u32 d, a, b performs integer addition with carry-in and writes the carry-out value into the condition code register. The semantics of this instruction would be
d = a + b + CC.CF. I also tryed addc.u32 d, a, b with no difference.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>
#include "device_launch_parameters.h"
#include <cuda.h>

typedef unsigned int u32;
#define TRY_CUDA_CALL(x) \
do \
  { \
    cudaError_t err; \
    err = x; \
    if(err != cudaSuccess) \
  { \
    printf("Error %08X: %s at %s in line %d\n", err, cudaGetErrorString(err), __FILE__, __LINE__); \
    exit(err); \
  } \
} while(0)


__device__ u32
__uaddo(u32 a, u32 b) {
    u32 res;
    asm("add.cc.u32 %0, %1, %2; /* inline */ \n\t" 
        : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__device__ u32
__uaddc(u32 a, u32 b) {
    u32 res;
    asm("addc.cc.u32 %0, %1, %2; /* inline */ \n\t" 
        : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__global__ void testing(u32* s)
{
    u32 a, b;

    a = 0xffffffff;
    b = 0x2;
    
    s[0] = __uaddo(a,b);
    s[0] = __uaddc(0,0);

}

int main()
{
    u32 *s_dev;
    u32 *s;
    s = (u32*)malloc(sizeof(u32));
    TRY_CUDA_CALL(cudaMalloc((void**)&s_dev, sizeof(u32)));
    testing<<<1,1>>>(s_dev);
    TRY_CUDA_CALL( cudaMemcpy(s, s_dev, sizeof(u32), cudaMemcpyDeviceToHost) );
    
    printf("s = %d;\n",s[0]);
    
    
    return 1;
}

As far as I know, you get a carry if the result doesn't fit in the variable, which happens here and an overflow if the sign bit is corrupted, but I'm working with unsigned values.
The code above tries to add 0xFFFFFFFF to 0x2 and of course the result won't fit on 32-bit, so why I don't get a 1 after __uaddc(0,0) call?

EDIT

Nvidia Geforce GT 520mx
Windows 7 Ultimate, 64-bit
Visual Studio 2012
CUDA 7.0

Upvotes: 2

Views: 1108

Answers (2)

njuffa
njuffa

Reputation: 26085

The only data dependencies affecting an asm() statement are those that are explicitly expressed by the variable bindings. Note that you can bind register operands, but not the condition codes. Since in this code the result of __uaddo(a, b) is immediately being overwritten, the compiler determines that it does not contribute to the observable results, is therefore "dead code" and can be eliminated. This is easily checked by examining the generated machine code (SASS) for a release build with cuobjdump --dump-sass.

If we had slightly different code that does not allow the compiler to eliminate the code for __uaddo() outright, there would still be the issue that the compiler can schedule any instructions it likes between the code generated for __uaddo() and __uaddc(), and such instructions could destroy any setting of the carry flag due to __uaddo().

As a consequence, if one plans to use the carry flag for multi-word arithmetic, both carry-generating and carry-consuming instructions must occur in the same asm() statement. A worked example can be found in this answer that shows how to add 128-bit operands. Alternatively, if two separate asm() statements must be used, one could export the carry flag setting from the earlier one into a C variable, then import it into the subsequent asm() statement from there. I can't think of many situations where this would be practical, as the performance advantage of using the carry flag is likely lost.

Upvotes: 3

Dani Grosu
Dani Grosu

Reputation: 554

So, as @njuffa already said, other instructions from other source code can modify the CC.CF register between the two calls and there is no guarantee for getting the expected value of the register.
As a possible solution the __add32 function can be used:

__device__ uint2 __add32 (u32 a, u32 b)
{
    uint2 res;
    asm ("add.cc.u32      %0, %2, %3;\n\t"
         "addc.u32        %1, 0, 0;\n\t"
         : "=r"(res.x), "=r"(res.y)
         : "r"(a), "r"(b));
    return res;
}

The res.y will have the possible carry and res.x the result of addition.

Upvotes: 1

Related Questions