user1043761
user1043761

Reputation: 716

Does CUDA have vector operation intrinsics?

I thought that since CUDA can do 64-bit 128-bit load/store it might have some intrinsics for adding/subtracting/etc. vector types like float3, in fewer instructions like SSE.

Does CUDA have any such functions?

Upvotes: 4

Views: 2296

Answers (2)

einpoklum
einpoklum

Reputation: 131646

Actually, these days, CUDA does have a few "vector operation intrinsics". At least, it does for half-precision floating point values.

Here's an example in PTX of the most obvious vector intrinsic: vectorized addition, with 2 half-precision floating-point values:


// put some floats in half-precision registers
cvt.rn.f16.f32 h0, f0;
cvt.rn.f16.f32 h1, f1;
cvt.rn.f16.f32 h2, f2;
cvt.rn.f16.f32 h3, f3;

mov.b32  p1, {h0, h1};   // pack two f16 to 32bit f16x2
mov.b32  p2, {h2, h3};   // pack two f16 to 32bit f16x2
add.f16x2  p3, p1, p2;   // SIMD f16x2 addition

See the relevant section of the PTX ISA guide.

Now, it's true that I've demonstrated this at the PTX level, but getting a proper C++ CUDA intrinsic is at most a matter of wrapping PTX-assembly instructions with an almost-one-liner function, if NVIDIA hasn't provided it already. See an example here, for the "SIMD video instructions" mentioned by @kunzmi (it's part of my cuda-kat library).

Upvotes: 1

user1043761
user1043761

Reputation: 716

No it does not. Each thread (as of Kepler) can only run 1 single-precision floating point operation at a time, with the exception of the FMA - which can perform 1 multiplication and 1 addition in a single instruction (z = a*x + y).

Upvotes: 1

Related Questions