Reputation: 78

SSE etc. vector programming (SIMD)

I'm totally new to SSE programming, but have an Intel Core i7 processor.

Basically, I want to take 4 32-bit unsigned integers and cube them all (raise to the power of 3) at once. It is my understanding that the SIMD functionality of SSE and its successors make this possible, but how in the world do I go about doing it? Preferably in C but I could manage assembly if necessary.

Edit to make clear my final goal:

Then, I want to add all the cubes together to come up with a single number.

Background: I'm just trying to use SSE to optimize figuring out if a number is an Armstrong number (a three-digit number whose sum of each digit cubed is the same as the number itself). An example is 153. There seems to be no way to do this other than brute force. These are a subset of Narcissistic numbers whose sum of all digits to the power of the length of the decimal number are equal to number itself. Hopefully, I'd like to eventually expand it to be more flexible, to start I'm just doing the Armstrong numbers. As you might imagine, this came up on another site and a few of us are trying to optimize the hell out of it. By taking your ideas and my own research, I came up with this code:

#include <stdio.h>
#include <smmintrin.h>  // SSE 4.1

__m128i vcube(const __m128i v)
{
    return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}


int main(int argc, const char * argv[]) {
    for (unsigned int i = 1; i <= 500; i++) {
        unsigned int firstDigit = i / 100;
        unsigned int secondDigit = (i - firstDigit * 100) / 10;
        unsigned int thirdDigit = (i - firstDigit * 100 - secondDigit * 10);

        __m128i v = _mm_setr_epi32(0, firstDigit, secondDigit, thirdDigit);
        __m128 v3 = (__m128) vcube(v);

        v3 = _mm_hadd_ps(v3, v3);
        v3 = _mm_hadd_ps(v3, v3);

        if (_mm_extract_epi32((__m128i) v3, 0) == i)    
            printf ("%03d is an Armstrong number\n", i);
        }
    return 0;
}

Note: I had to do some type coercions to get it to compile in some systems (Solaris, at least some Linux).

So this works, but maybe it could be streamlined. Sorry I didn't post the whole task, but I was trying to break it down into steps and I wanted to make sure each digit was correctly cubed.

(END EDIT)

Thank you!

Edit: I guess I should add I'm running Mac OS X Sierra.

EDIT AGAIN:

So, let's say I make these all these unsigned shorts instead of unsigned ints and add more digits, how do I add them together when a short may not be able to hold the sum of all the digits? Is there a way to add them and store in a vector of larger variables if you know what I mean, or a plain larger number such as a UInt64?

Sorry for all the questions, but like I said I'm totally new at vector processing even though I had access to it since my first Mac G4.

Upvotes: 2

Answers (3)

Paul R

Reputation: 213130

If your input values are in the range 0..1625 (so that the result fits in 32 bits) then you can use _mm_mullo_epi32:

__m128i vcube(const __m128i v)
{
    return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}

Demo:

#include <stdio.h>
#include <smmintrin.h>  // SSE 4.1

__m128i vcube(const __m128i v)
{
    return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}

int main()
{
    __m128i v = _mm_setr_epi32(0, 1, 1000, 1625);
    __m128i v3 = vcube(v);

    printf("%vlu => %vlu\n", v, v3);

    return 0;
}

Compile and test:

$ gcc -Wall -Wno-format-invalid-specifier -Wno-format-extra-args -msse4 vcube.c && ./a.out
0 1 1000 1625 => 0 1 1000000000 4291015625

Upvotes: 5

Z boson

Reputation: 33679

For x<=2642245 you can do x*x*x using the foo_SSE function below using SSE4.1. This takes two 32-bit unsigned intergs as input packed into the upper and lower 64-bits of a SSE register and outputs two 64-bit integers.

#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>

__m128i foo_SSE(__m128i x) {
  __m128i mask = _mm_set_epi32(-1, 0, -1, 0);
  __m128i x2 =_mm_shuffle_epi32(x, 0x80);
  __m128i t0 = _mm_mul_epu32(x,x);
  __m128i t1 = _mm_mul_epu32(t0,x);
  __m128i t2 = _mm_mullo_epi32(t0,x2);
  __m128i t3 = _mm_and_si128(t2, mask);
  __m128i t4 = _mm_add_epi32(t3, t1);
  return t4;
}

int main(void) {
  uint64_t k1 = 100000;
  uint64_t k2 = 2642245;                                                                                                                                                             
  __m128i x = _mm_setr_epi32(k1, 0, k2, 0);
  uint64_t t[2];
  _mm_store_si128((__m128i*)t, foo_SSE(x));
  printf("%20" PRIu64 " ",  t[0]);
  printf("%20" PRIu64 "\n", t[1]);
  printf("%20" PRIu64 " ",  k1*k1*k1);
  printf("%20" PRIu64 "\n", k2*k2*k2);    
}

This can probably be improved a bit. I'm a little out of practice.

Upvotes: 3

hroptatyr

Reputation: 4829

To get a quick overview about the 3 main stages (loading, operating, storing) see the following snippet. For integers e0 and e1:

#include "emmintrin.h"
__m128i result __attribute__((aligned(16)));
__m128i x = _mm_setr_epi32(0, e1, 0, e0);
__m128i cube = _mm_mul_epu32(x, _mm_mul_epu32(x, x));
_mm_store_si128(&result, cube);

The _mm_mul_epu32 takes the even multiples of 32bits of two _m128i registers, multiplies them and puts the result as 2-tuple of 64bits into the result register.

To get them out of there access either access them through a cast or use your compiler's convenience definition of __m128i, e.g. for icc:

printf("%llu %llu\n", result.m128i_i64[0], result.m128i_i64[1]); /* msc style */

Note: I'm using the Intel Intrinsics guide for SSE primitives.

Edited for clarity about what the code actually does.

Upvotes: 2

SSE etc. vector programming (SIMD)

Answers (3)

Related Questions