Reputation: 132128
I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes.
I know that AVX-512 has a VPOPCNTDQ instruction which counts the number of 1 bits in each consecutive 64 bits within a 512-bit vector, and IIANM it should be possible to issue one of these every cycle (if an appropriate SIMD vector register is available) - but I don't have any experience writing SIMD code (I'm more of a GPU guy). Also, I'm not 100% sure about compiler support for AVX-512 targets.
On most CPUs, still, AVX-512 is not (fully) supported; but AVX-2 is widely-available. I've not been able to find a less-than-512-bit vectorized instruction similar to VPOPCNTDQ, so even theoretically I'm not sure how to count bits fast with AVX-2 capable CPUs; maybe something like this exists and I just missed it somehow?
Anyway, I'd appreciate a short C/C++ function - either using some intrinsics-wrapper library or with inline assembly - for each of the two instruction sets. The signature is:
uint64_t count_bits(void* ptr, size_t size);
Notes:
Upvotes: 9
Views: 8066
Reputation: 365507
Wojciech Muła's big-array popcnt functions look optimal except for the scalar cleanup loops. (See @einpoklum's answer for details on the main loops).
A 256-entry LUT you use only a couple times at the end is likely to cache-miss, and isn't optimal for more than 1 byte even if cache was hot. I believe all AVX2 CPUs have hardware popcnt
, and we can easily isolate the last up-to-8 bytes that haven't been counted yet to set us up for a single popcnt
.
As usual with SIMD algorithms, it often works well to do a full-width load that ends at the last byte of the buffer. But unlike with a vector register, variable-count shifts of the full integer register are cheap (especially with BMI2). Popcnt doesn't care where the bits are, so we can just use a shift instead of needing to construct an AND mask or whatever.
// untested
// ptr points at the first byte that hasn't been counted yet
uint64_t final_bytes = reinterpret_cast<const uint64_t*>(end)[-1] >> (8*(end-ptr));
total += _mm_popcnt_u64( final_bytes );
// Careful, this could read bytes before the start of a small buffer.
The pointer-subtraction needs to be char*
since we scale it by 8 bits per byte. (Or void*
for compilers like GCC which allow pointer math on void*
.)
If the last 8 bytes of the buffer aren't aligned by 8 and alias-compatible with uint64_t
, you actually need to use memcpy(&final_bytes, end-8, sizeof(final_bytes))
or an __attribute__((aligned(1),may_alias))
typedef for uint64_t
, otherwise you have alignment and/or strict-aliasing UB.
Or even better, use more sophisticated logic to avoid page-crossing. This can avoid page-crossing for a 6-byte buffer at the start of a page, for example, or other cases where the last 8 bytes are split across pages but maybe the bytes not counted by your SIMD loop only come from the final page. Then you want to load past the end of the buffer and left-shift or bzhi
to isolate the bits you want at the top or bottom of the register.
Maxim mentions __builtin_ia32_bextr_u64
(BMI1 bextr
) in comments, but Intel implements it as 2 uops (AMD as 1), and setting up start and len 8-bit bitfields in a register will also take some instructions. (e.g. shift/OR or a write to a high-8 register like AH which might stall the front-end for a cycle when read.)
AMD Piledriver has an immediate form of bextr
in the TBM extension, but Intel never picked it up and they dropped it for Zen.
If you need the same bitfield-extract repeatedly, especially on an AMD CPU, bextr
is useful. But I don't think it's better than a right shift for this use-case; you need the same shift count in a register as you would for a shift, but also need a length.
Upvotes: 4
Reputation: 132128
@HadiBreis' comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It's based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles.
# include <immintrin.h>
# include <x86intrin.h>
std::uint64_t popcnt_AVX2_lookup(const uint8_t* data, const size_t n) {
size_t i = 0;
const __m256i lookup = _mm256_setr_epi8(
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,
/* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
/* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
/* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
/* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
);
const __m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i acc = _mm256_setzero_si256();
#define ITER { \
const __m256i vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i)); \
const __m256i lo = _mm256_and_si256(vec, low_mask); \
const __m256i hi = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
const __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo); \
const __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi); \
local = _mm256_add_epi8(local, popcnt1); \
local = _mm256_add_epi8(local, popcnt2); \
i += 32; \
}
while (i + 8*32 <= n) {
__m256i local = _mm256_setzero_si256();
ITER ITER ITER ITER
ITER ITER ITER ITER
acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
}
__m256i local = _mm256_setzero_si256();
while (i + 32 <= n) {
ITER;
}
acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
#undef ITER
uint64_t result = 0;
result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 0));
result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 1));
result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 2));
result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 3));
for (/**/; i < n; i++) {
result += lookup8bit[data[i]];
}
return result;
}
The same repository also has a VPOPCNT-based AVX-512 implementation. Before listing the code for it, here's the simplified and more readable pseudocode:
For every consecutive sequence of 64 bytes:
- Load the sequence into a SIMD register with 64x8 = 512 bits
- Perform 8 parallel population counts of 64 bits each on that register
- Add the 8 population-count results in parallel, into an "accumulator" register holding 8 sums
Sum up the 8 values in the accumulator
If there's a tail of less than 64 bytes, count the bits there in some simpler way
Return the main sum plus the tail sum
And now for the real deal:
# include <immintrin.h>
# include <x86intrin.h>
uint64_t avx512_vpopcnt(const uint8_t* data, const size_t size) {
const size_t chunks = size / 64;
uint8_t* ptr = const_cast<uint8_t*>(data);
const uint8_t* end = ptr + size;
// count using AVX512 registers
__m512i accumulator = _mm512_setzero_si512();
for (size_t i=0; i < chunks; i++, ptr += 64) {
// Note: a short chain of dependencies, likely unrolling will be needed.
const __m512i v = _mm512_loadu_si512((const __m512i*)ptr);
const __m512i p = _mm512_popcnt_epi64(v);
accumulator = _mm512_add_epi64(accumulator, p);
}
// horizontal sum of a register
uint64_t tmp[8] __attribute__((aligned(64)));
_mm512_store_si512((__m512i*)tmp, accumulator);
uint64_t total = 0;
for (size_t i=0; i < 8; i++) {
total += tmp[i];
}
// popcount the tail
while (ptr + 8 < end) {
total += _mm_popcnt_u64(*reinterpret_cast<const uint64_t*>(ptr));
ptr += 8;
}
while (ptr < end) {
total += lookup8bit[*ptr++];
}
return total;
}
The lookup8bit
is a popcnt lookup table for bytes rather than bits, and is defined here. edit: As commenters note, using an 8-bit lookup table at the end is not a very good idea and can be improved on.
Upvotes: 7