Reputation: 716
I thought that since CUDA can do 64-bit 128-bit load/store it might have some intrinsics for adding/subtracting/etc. vector types like float3, in fewer instructions like SSE.
Does CUDA have any such functions?
Upvotes: 4
Views: 2296
Reputation: 131646
Actually, these days, CUDA does have a few "vector operation intrinsics". At least, it does for half-precision floating point values.
Here's an example in PTX of the most obvious vector intrinsic: vectorized addition, with 2 half-precision floating-point values:
// put some floats in half-precision registers
cvt.rn.f16.f32 h0, f0;
cvt.rn.f16.f32 h1, f1;
cvt.rn.f16.f32 h2, f2;
cvt.rn.f16.f32 h3, f3;
mov.b32 p1, {h0, h1}; // pack two f16 to 32bit f16x2
mov.b32 p2, {h2, h3}; // pack two f16 to 32bit f16x2
add.f16x2 p3, p1, p2; // SIMD f16x2 addition
See the relevant section of the PTX ISA guide.
Now, it's true that I've demonstrated this at the PTX level, but getting a proper C++ CUDA intrinsic is at most a matter of wrapping PTX-assembly instructions with an almost-one-liner function, if NVIDIA hasn't provided it already. See an example here, for the "SIMD video instructions" mentioned by @kunzmi (it's part of my cuda-kat library).
Upvotes: 1
Reputation: 716
No it does not. Each thread (as of Kepler) can only run 1 single-precision floating point operation at a time, with the exception of the FMA - which can perform 1 multiplication and 1 addition in a single instruction (z = a*x + y).
Upvotes: 1