Mode 12 of _mm_cmpistri

Question

Simd algorithm for sub-string search in a 2016 paper:

bool like(const uint8_t* string, __m128i pat, [...]) {
    size_t i = 0;
    while (i + 16 < str_len) {
        __m128i str = _mm_loadu_si128(&string[i]);
        size_t j = _mm_cmpistri(pat, str, 12);  // mode 12
        if (j >= 16) i += 16;
        else {
            if (j + pat_len <= 16) return true;
            i += j;
        }
    }
    // Process remainder
    if (i + pat_len <= str_len) {
        __m128i str = _mm_loadu_si128(&string[i]);
        size_t j = _mm_cmpestri(pat, pat_len,
                                str, str_len - i, 12);
        if (j < 16 && j + pat_len <= 16) return true;
    }
    return false;
}

What is mode 12 of _mm_cmpistri?

Is this quite slow?

Thanks.

Peter Cordes · Accepted Answer

pcmpistri has one per 2 clock throughput on Ryzen, one per 3 clocks on Skylake. It's one of the faster SSE4.2 string instructions, faster than the explicit-length ones. (https://agner.org/optimize/). It's pretty good for substring searches, but not for simpler strchr / memchr searches: How much faster are SSE4.2 string instructions than SSE2 for memcmp? and SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

Note that your title mentions _mm_cmpestri, the slow version for explicit-length strings. But your code uses _mm_cmpistri, the fast version for implicit-length strings.

(The rest of the code in that search loop should compile pretty efficiently. If the compiler uses a branch instead of cmov for the i+=16 vs. i+=j condition, branch prediction + speculative execution will hide the dependency so multiple iterations can be in flight at once, at the cost of a branch miss for most cases of finding a partial match at the end of an input vector. At least I think that's what the condition is for. Using cmov would create a data dependency between input vectors, and the instruction's latency is ~2 or 3x its throughput.)

I don't know how well it compares with a well-tuned strstr using AVX2 that avoids the SSE4.2 string instructions. I'd guess it might depend on the length of the substring you're searching for, or maybe other properties of the data like how many false-positive candidates for the start or end of the string you find.

The microbenchmarks you already found on https://github.com/WojciechMula/sse4-strstr should be good. Wojciech writes good code, and understands enough about tuning for various x86 uarches to really optimize well. I haven't looked at his string benchmarks, but I have looked at his popcnt code that explores using Harley-Seal with AVX512F vpernternlogd for a big speedup.

Intel's ISA ref manual (vol.2) has a whole section about modes for string instructions (Section 4.1, “Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM”), separate from the entry on https://www.felixcloutier.com/x86/pcmpistri.

Normally you'd write the mode in hex or binary, not decimal, because it has multiple bit-fields. 12 = 0b00001100.

Intel's intrinsics guide also has pseudocode for the full details on the operation, but it's pretty heavy going if you don't know the high-level purpose. Once you do, it can be helpful. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2403,6062,4147,948&techs=SSE4_2,AVX,AVX2&text=pcmpi

See also https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a more readable guide to the various modes. Quoting part of it here:

Aggregation operations

The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).
...
Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
 operand2 = "WhenWeWillBeWed!", operand1 = "We"
 IntRes1  =  000010000000100
After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask (_mm_cmpistrm) or shrinked into index (_mm_cmpistri). The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeat them here.

The low 2 bits of the byte (00) indicate the character format: in this case 00 unsigned BYTE.

(signed vs. unsigned is probably irrelevant for a mode that compares for equality, not range-based.)

Bits 5:4 are the "polarity", for dealing with the end of the string, I think.

Bit 6 is the bit-scan direction for the "index" versions of the instruction that return an index instead of mask. (like bsr vs. bsf). 0 in this case finds the start of the first match instead of the end of the last match.

Bit 7 (the high bit of the 8-bit immediate) is unused / reserved.

See also https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpistri for tables / diagrams of the steps that lead to a result, and how different fields in the immediate modify / select the operations performed at various steps.

Mode 12 of _mm_cmpistri

Answers (1)

Related Questions