How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an 8x8 block of bytes)

Question

I have an algorithm that measures the width of each cell (8x8) in a bitmap (128x128) by counting the distance from the start of a cell to the first column within it containing only zeroes. If there is no such column, that cell is assigned a width of 8.

Here is the source-code for the program:

#include 

uint8_t bitmap[128][128];  // all 8 bits are used, so 0..255
int widths[16][16];        // any integer type is fine, like uint8_t

void measure_cell(int i, int j)
{
    int x = i * 8;
    int x_end = x + 8;

    /* Horizontal sweep */
    while (x < x_end) {
        int y = j * 8;
        int y_end = y + 8;

        /* Vertical sweep */
        while (y < y_end && !bitmap[y][x])
            ++y;

        /* All-zero column? */
        if (y == y_end)
            break;

        ++x;
    }

    widths[j][i] = 8 - (x_end - x);
}

int main()
{
    /* Load bitmap from file */
    // ...

    /* Calculate widths for each cell */
    for (int j = 0; j < 16; ++j)
        for (int i = 0; i < 16; ++i)
            measure_cell(i, j);

    /* Print widths to stdout */
    // ...

    return 0;
}

Is there any way to speed this up using SIMD? Assume I am targeting x86-64.

Peter Cordes · Accepted Answer

Vertically OR 8 vectors of row data (so a byte of the result is only 0 if all bytes in the column were zero).
_mm256_loadu_si256 and _mm256_or_si256.
(or aligned load if you use alignas(32) on your array.)

If it's useful, you could instead use pmaxub (_mm_max_epu8) to get the largest value in that column of the cell, also being 0 only if every column is 0. But if you're not going to use the max value for anything, just use bitwise OR.
Compare / movemask to get a bitmask of which columns were all-zero.
_mm256_cmpeq_epi8(v, _mm256_setzero_si256()) / _mm256_movemask_epi8
Split up that bitmask into 8-bit chunks (cell boundaries) and bit-scan (tzcnt / C++20 std::countr_zero() / C23 stdc_trailing_zeros) to find the position of the first (lowest) zero.
Since you want to clamp the result to 8, actually do stdc_trailing_zeros(mask | (1<<8)), or I guess stdc_trailing_zeros_uc since unsigned char is 8 bits wide on x86.

_mm256_cmpeq_epi8 is AVX2 for __m256i vectors, producing a 32-bit bitmask (4 cells wide).
The same thing works with just SSE2 with __m128i vectors (16 bit mask = 2 cells).

If you have AVX-512, you could vectorize the bit-scan part, too. 63 - vplzcntq can be used as a tzcnt if you use a bithack to isolate the lowest set bit. (And special-case to get 64 instead of -1 I guess?). You don't need to booleanize into a vector mask (which would restrict you to 256-bit vectors since AVX-512 can only compare into mask registers). Instead just find the position of the lowest set bit in the 64-bit element, and right-shift by 3 to get index of lowest non-zero byte. That should conveniently clear the low 3 bits that have unwanted garbage for bit-index inside a byte.

Or store masks to temporary storage somewhere (perhaps the width array before replacing it with counts). Then later reload with vectors for a vectorized tzcnt per byte. There isn't a vplzcntb, but a bithack can efficiently set up for vpopcntb (AVX512BITALG in Ice Lake and later). We'd want to turn the trailing 0s into 1s and clear everything else.

(~v) & (v-1) makes a mask up to and not including the lowest 1 bit, or all-ones for an input of zero. (Using _mm512_sub_epi8 element size for the subtraction of course.) This is similar to (x-1) ^ x (scalar blsmsk) which makes a mask up to and including the lowest 1 bit.

See Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise for more details.

Also related: https://catonmat.net/low-level-bit-hacks for bithack basics, and https://web.archive.org/web/20231120194321/https://graphics.stanford.edu/%7Eseander/bithacks.html#ZerosOnRightLinear for some bithacks to get ctz directly, rather than feed vpopcntb. (But that's probably worse: it would still I think need 3 steps of shift/mask/something, and x86 doesn't have 8-bit SIMD shifts.)

Even with just AVX2, there's probably something you can do with vpshufb lookup tables (like the popcount strategy: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp), or perhaps DeBruijn sequences with 16-bit multiplies after isolating the lowest set bit. See my vectorized BSF answer I linked in the previous section.

This might not be worth it vs. just doing four scalar tzcnt operations since each 32 bits of mask data ends up in a GPR anyway after vpmovmskb. That's certainly easier and might be fast enough even if it's possible to go a bit faster even without AVX-512.

How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an 8x8 block of bytes)

Answers (2)

Related Questions