user541686
user541686

Reputation: 210633

Extremely fast is_iequal? (case-insensitive equality comparison)

I'm trying to figure out how to write a very fast is_iequal function, optimized for ASCII, to compare if two characters are equal in a case-insensitive manner.

The ultimate goal is for this functor to be used with boost::algorithm::starts_with, etc.

So far my attempt has produced the following:

#include <locale>
unsigned long fast_rand(void);

template<class Ch> struct is_iequal
{
    std::ctype<Ch> const &ctype;
    is_iequal(std::ctype<Ch> const &ctype) : ctype(ctype) { }
    bool operator()(Ch const c1, Ch const c2) const
    {
        return c1 == c2 ||
            ('a' <= c1 && c1 <= 'z' && c1 - 'a' == c2 - 'A') ||
            ('A' <= c1 && c1 <= 'Z' && c1 - 'A' == c2 - 'a') ||
            !(c1 <= '\x7F' && c2 <= '\x7F') &&
            ctype.toupper(c1) == ctype.toupper(c2);
    }
};

int main()
{
    size_t const N = 1 << 26;
    typedef wchar_t TCHAR;
    std::locale loc;
    std::ctype<TCHAR> const &ctype = std::use_facet<std::ctype<TCHAR> >(loc);
    is_iequal<TCHAR> const is_iequal(ctype);  // Functor

    TCHAR *s1 = new TCHAR[N], *s2 = new TCHAR[N];
    for (size_t i = 0; i < N; i++) { s1[i] = fast_rand() & 0x7F; }
    for (size_t i = 0; i < N; i++) { s2[i] = fast_rand() & 0x7F; }

    bool dummy = false;
    clock_t start = clock();
    for (size_t i = 0; i < N; i++) { dummy ^= is_iequal(s1[i], s2[i]); }
    printf("%u ms\n", (clock() - start) * 1000 / CLOCKS_PER_SEC, dummy);
}

unsigned long fast_rand(void)  // Fast RNG for testing (xorshf96)
{
    static unsigned long x = 123456789, y = 362436069, z = 521288629;

    x ^= x << 16;
    x ^= x >> 5;
    x ^= x << 1;

    unsigned long t = x;
    x = y;
    y = z;
    z = t ^ x ^ y;

    return z;
}

which, on my computer, runs in 584 ms (VC++ 2011 x64).

It's still a bit too slow for my application though -- it's still the bottleneck in my actual program, which causes a slight UI delay I'd like to get rid of if possible.

What can I do to optimize is_iequals further, without changing its interface?


Note: Yes, I am aware of the various problems with this code (UTF-16 handling, pedantic C++ issues with implicit casting to/from char, etc...) but they're irrelevant to my goal here so I'm completely ignoring them for the time being.

Upvotes: 4

Views: 1034

Answers (5)

benwills
benwills

Reputation: 1

My response is in C, but maybe it's still useful for you...

In the process of writing a faster (by ~3x) tolower function, I also tested various forms of case-insensitive ascii character equality on three different machines. Here's the fastest I could get, which was, surprisingly, slightly faster than and integer-based bit operation:

uint8_t is_eq_bsc(char c, char i){
    c |= (c | ' ');
    i |= (i | ' ');
    return c == i;
}

Here's the gist with the benchmarks on three different machines and the various code I tested: https://gist.github.com/benwills/5170f2e6adbb67e3ec4c

Upvotes: 0

Jerry Coffin
Jerry Coffin

Reputation: 490358

I'm getting faster results if I replace:

('a' <= a && a <= 'z' || 'A' <= a && a <= 'Z') ||

...from your answer with:

(unsigned char)((a & 0x20) - 'A') < 26 ||

The a & 0x20 will collapse lower to upper case (it will also affect some other characters, but we'll screen them out in a second). Subtracting 'A' then produces a negative number if the collapsed value is less than A. Converting to unsigned reduces that modulo UCHAR_MAX (typically 255) so the negative numbers become large positive numbers. Then with only one test, we find whether that started out as an upper- or lower-case letter.

Upvotes: 0

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275740

128x128 lookup table. Always do this lookup (mask the inputs).

That reduces your branches to one (do you call to upper?). Calculate that one without doing && or || - use branchless logic.

Maybe make the table a whole byte squared. Also try using a tighter lookup table where you extract the lookup with bit twiddling, and more twiddling to determine if it is thrown out and == used instead. (a&b)|(c&~b) is b?a:c without a branch.

And branch prediction failures suck.

Pick table size after experimenting and testing.

So my table is bool equal[128][128] = {…} -- not a lookup then an equal, but just a lookup.

Upvotes: 2

user541686
user541686

Reputation: 210633

@Mysticial's comment together with a little bit of tweaking really seemed to help.

First I tried this:

template<class Ch>
struct is_iequal
{
    std::ctype<Ch> const &ctype;
    is_iequal(std::ctype<Ch> const &ctype) : ctype(ctype) { }
    bool operator()(Ch const a, Ch const b) const
    {
        return a == b ||
            ('a' <= a && a <= 'z' || 'A' <= a && a <= 'Z') &&
            (a & ~('a' - 'A')) == (b & ~('a' - 'A')) ||
            a > SCHAR_MAX && b > SCHAR_MAX &&
            ctype.toupper(a) == ctype.toupper(b);
    }
};

which didn't help much, but then I thought, hey, why not swap the two sides of the &&?

template<class Ch>
struct is_iequal
{
    std::ctype<Ch> const &ctype;
    is_iequal(std::ctype<Ch> const &ctype) : ctype(ctype) { }
    bool operator()(Ch const a, Ch const b) const
    {
        return a == b ||
            (a & ~('a' - 'A')) == (b & ~('a' - 'A')) &&
            ('a' <= a && a <= 'z' || 'A' <= a && a <= 'Z') ||
            a > SCHAR_MAX && b > SCHAR_MAX &&
            ctype.toupper(a) == ctype.toupper(b);
    }
};

That brought it down to 138 ms!

Upvotes: 2

Alexei Levenkov
Alexei Levenkov

Reputation: 100545

Consider inlining toLower for c<127 - memory cost will be small enough to be in cache but speed may be better:

char localToLow[128] =....
return c1 < 127 && c2 < 127 ? localToLow[c1]==localToLow[c2] :
    ctype.toupper(c1) == ctype.toupper(c2);

(< 127 can be replaced with ((c1 | c2) & ~127 ) :) )

Upvotes: 2

Related Questions