user3225118
user3225118

Reputation: 51

Variable length array estension using SIMD operation

I would like to do the following array extension using SIMD intrinsic. I have two arrays:

I would like to create a resultant array containing the values: v_i repeated for l_i times, i.e:

result: 10, 10, 10, 20, 20, 30, 40, 40.

How can I compute this using SIMD intrinsic?

Upvotes: 2

Views: 363

Answers (1)

Evgeny Kluev
Evgeny Kluev

Reputation: 24667

This may be optimized by SIMD if input array size is up to 8, output array size up to 16, and bytes as array values. At least SSSE3 is required. Extending this approach to larger arrays/elements is possible but efficiency will quickly degrade.

  1. Compute prefix sum of array lengths. This may be quickly done if you reinterpret byte array of lengths as a single 64-bit (32-bit) word, multiply it by 0x101010101010100, and store the result in SIMD register.
  2. Fill array of indexes (in single SIMD register) with average index (half-size of the array of prefix sums).
  3. Perform binary search for proper index for each byte of index register (in parallel). This may be done by extracting appropriate byte of prefix sum register with PSHUFB instruction, comparing extracted prefix value with byte number using PCMPGTB (and optionally with PCMPEQB), then adding/subtracting half of index range.
  4. (Optionally) fill all unused bytes of index register with 0xff.
  5. Use PSHUFB to fill some register with values from cluster value array indexed by the index register.

Main loop of the algorithm (binary search) contains PSHUFB, PCMPGTB, and a few arithmetical and logical operations. It is executed log(input_array_size) times, most likely 2 or 3 times.

Here is an example:

cluster value:      10 20 30 40
cluster length:      3  2  1  2
prefix sum:          0  3  5  6  8
indexes:             2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
prefix value:        5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
byte number:         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
mask:               ff ff ff ff ff  0  0  0  0  0  0  0  0  0  0  0
indexes:             1  1  1  1  1  3  3  3  3  3  3  3  3  3  3  3
prefix value:        3  3  3  3  3  6  6  6  6  6  6  6  6  6  6  6
byte number:         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
mask:               ff ff ff  0  0 ff  0  0  0  0  0  0  0  0  0  0
indexes:             0  0  0  1  1  2  3  3  3  3  3  3  3  3  3  3
length constrained:  0  0  0  1  1  2  3  3 ff ff ff ff ff ff ff ff
cluster value:      10 10 10 20 20 30 40 40  0  0  0  0  0  0  0  0

Upvotes: 2

Related Questions