Reputation: 121
I am trying to compile a simple program that uses __m128i
using cuda, but when I compile using nvcc (nvcc test.cu -o test) on Linux, I get "__m128i" is a vector, which is not supported in device code
.
This is the program I am trying to compile
#include <stdio.h>
#include <emmintrin.h>
__global__ void hello(){
printf("%d\n",threadIdx.x);
__m128i x;
}
int main(){
hello<<<3,3>>>();
}
When I type nvcc --version
, I get Cuda compilation tools, release 10.2, V10.2.89
I actually faced this problem on a larger scale trying to implement some cpp code using CUDA and this cpp code uses __m128i
, and what I have shown is the simple version of the problem I am facing, so I am wondering if there is a way to use __m128i
in a CUDA kernel, or some other alternative. Thanks
Upvotes: 0
Views: 592
Reputation: 72372
I am wondering if there is a way to use
__m128i
in a CUDA kernel ...
There is not. CUDA has native 128 bit integer types which meet the same alignment properties as __m128i
, but a host vector type is not supported.
or some other alternative
As noted above, there are 16 byte aligned types which can be used to load and store data, but there is no native 128 bit SIMD intrinsic support in NVIDIA GPUs. Those SIMD instructions which exist are limited to 32 bit types.
CPU SIMD is done with short vectors like 128-bit __m128i
. GPU SIMD is done across warps, and not usually software-visible in the same way as __m128i
CPU SIMD, you just write it as scalar code.
Code manually vectorized with __m128i
can't be compiled for a GPU. If it has a scalar fallback version, use that, e.g. #undef __SSE2__
.
(CUDA SIMD within 32-bit chunks lets you get more use out of the 32-bit wide ALUs in each GPU execution unit if you have narrow data, like pairs of 16-bit integers, or 4x 8-bit integers. So if your SSE intrinsics code uses _mm_add_epi8
, you might still benefit from manual vectorization in CUDA with its 4x 8-bit operations instead of 16x 8-bit.)
Upvotes: 3