ARM NEON: how to implement logic like `mm_mask_compress`?

Question

A function named filter_range is designed to retain elements nums[i] for which filter[i] != 0 and eliminate all others. Its logic is as follows.

static size_t filter_range(int* nums, const uint8_t* filter, size_t size) {
    size_t result_offset = 0;
    for (auto i = 0; i < size; i++) {
        if (filter[i]) {
            *(nums + result_offset) = *(nums + i);
            result_offset++;
        }
    }
    return result_offset;
}

For AMD's AVX512, the _mm512_mask_compress_epi instruction can be conveniently utilized to achieve this functionality. For ARM NEON, is it possible to employ SIMD instructions to maximize the acceleration of this function?

Inspired by this article, we could get matches, each four bits of which represents a filter value. Then we could use matches &= (matches - 1) and __builtin_ctzll to iterate through bits, which referred to as Kernighan's algorithm.

However, this is still worse than AVX512 version by using _mm512_mask_compress_epi.


static size_t filter_range2(int* data, const uint8_t* filter, size_t size) {
    constexpr size_t data_type_size = sizeof(int);
    size_t start_offset = 0;
    size_t result_offset = 0;

#if defined(__ARM_NEON__) || defined(__aarch64__)
    std::cout << "__ARM_NEON__" << std::endl;
    constexpr size_t kBatchNums = 128 / (8 * sizeof(uint8_t));
    while (start_offset + kBatchNums < size) {
        uint8x16_t vfilter = vld1q_u8(filter);
        const uint16x8_t non_zero_mask = vreinterpretq_u16_u8(vmvnq_u8(vceqzq_u8(vfilter)));
        const uint8x8_t res = vshrn_n_u16(non_zero_mask, 4);
        // Each matched byte denotes 4-bit 1 (`0b1111`)
        uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);

        if (matches == 0) {
            // skip
        } else if (matches == 0xffff'ffff'ffff'ffffull) {
            memmove(data + result_offset, data + start_offset, kBatchNums * data_type_size);
            result_offset += kBatchNums;
        } else {
            matches &= 0x8888'8888'8888'8888ull;
            for (; matches > 0; matches &= matches - 1) {
                uint32_t index = __builtin_ctzll(matches) >> 2;
                *(data + result_offset++) = *(data + start_offset + index);
            }
        }

        start_offset += kBatchNums;
        filter += kBatchNums;
    }
#endif

    for (auto i = start_offset; i < size; ++i) {
        if (filter[i]) {
            *(data + result_offset) = *(data + i);
            result_offset++;
        }
    }

    return result_offset;
}

ARM NEON: how to implement logic like `mm_mask_compress`?

Answers (0)

Related Questions