Reputation: 210633
I'm trying to figure out how to write a very fast is_iequal
function, optimized for ASCII, to compare if two characters are equal in a case-insensitive manner.
The ultimate goal is for this functor to be used with boost::algorithm::starts_with
, etc.
So far my attempt has produced the following:
#include <locale>
unsigned long fast_rand(void);
template<class Ch> struct is_iequal
{
std::ctype<Ch> const &ctype;
is_iequal(std::ctype<Ch> const &ctype) : ctype(ctype) { }
bool operator()(Ch const c1, Ch const c2) const
{
return c1 == c2 ||
('a' <= c1 && c1 <= 'z' && c1 - 'a' == c2 - 'A') ||
('A' <= c1 && c1 <= 'Z' && c1 - 'A' == c2 - 'a') ||
!(c1 <= '\x7F' && c2 <= '\x7F') &&
ctype.toupper(c1) == ctype.toupper(c2);
}
};
int main()
{
size_t const N = 1 << 26;
typedef wchar_t TCHAR;
std::locale loc;
std::ctype<TCHAR> const &ctype = std::use_facet<std::ctype<TCHAR> >(loc);
is_iequal<TCHAR> const is_iequal(ctype); // Functor
TCHAR *s1 = new TCHAR[N], *s2 = new TCHAR[N];
for (size_t i = 0; i < N; i++) { s1[i] = fast_rand() & 0x7F; }
for (size_t i = 0; i < N; i++) { s2[i] = fast_rand() & 0x7F; }
bool dummy = false;
clock_t start = clock();
for (size_t i = 0; i < N; i++) { dummy ^= is_iequal(s1[i], s2[i]); }
printf("%u ms\n", (clock() - start) * 1000 / CLOCKS_PER_SEC, dummy);
}
unsigned long fast_rand(void) // Fast RNG for testing (xorshf96)
{
static unsigned long x = 123456789, y = 362436069, z = 521288629;
x ^= x << 16;
x ^= x >> 5;
x ^= x << 1;
unsigned long t = x;
x = y;
y = z;
z = t ^ x ^ y;
return z;
}
which, on my computer, runs in 584 ms (VC++ 2011 x64).
It's still a bit too slow for my application though -- it's still the bottleneck in my actual program, which causes a slight UI delay I'd like to get rid of if possible.
What can I do to optimize is_iequals
further, without changing its interface?
Note: Yes, I am aware of the various problems with this code (UTF-16 handling, pedantic C++ issues with implicit casting to/from char
, etc...) but they're irrelevant to my goal here so I'm completely ignoring them for the time being.
Upvotes: 4
Views: 1034
Reputation: 1
My response is in C, but maybe it's still useful for you...
In the process of writing a faster (by ~3x) tolower function, I also tested various forms of case-insensitive ascii character equality on three different machines. Here's the fastest I could get, which was, surprisingly, slightly faster than and integer-based bit operation:
uint8_t is_eq_bsc(char c, char i){
c |= (c | ' ');
i |= (i | ' ');
return c == i;
}
Here's the gist with the benchmarks on three different machines and the various code I tested: https://gist.github.com/benwills/5170f2e6adbb67e3ec4c
Upvotes: 0
Reputation: 490358
I'm getting faster results if I replace:
('a' <= a && a <= 'z' || 'A' <= a && a <= 'Z') ||
...from your answer with:
(unsigned char)((a & 0x20) - 'A') < 26 ||
The a & 0x20
will collapse lower to upper case (it will also affect some other characters, but we'll screen them out in a second). Subtracting 'A'
then produces a negative number if the collapsed value is less than A
. Converting to unsigned reduces that modulo UCHAR_MAX (typically 255) so the negative numbers become large positive numbers. Then with only one test, we find whether that started out as an upper- or lower-case letter.
Upvotes: 0
Reputation: 275740
128x128 lookup table. Always do this lookup (mask the inputs).
That reduces your branches to one (do you call to upper?). Calculate that one without doing && or || - use branchless logic.
Maybe make the table a whole byte squared. Also try using a tighter lookup table where you extract the lookup with bit twiddling, and more twiddling to determine if it is thrown out and == used instead. (a&b)|(c&~b)
is b?a:c
without a branch.
And branch prediction failures suck.
Pick table size after experimenting and testing.
So my table is bool equal[128][128] = {…}
-- not a lookup then an equal, but just a lookup.
Upvotes: 2
Reputation: 210633
@Mysticial's comment together with a little bit of tweaking really seemed to help.
First I tried this:
template<class Ch>
struct is_iequal
{
std::ctype<Ch> const &ctype;
is_iequal(std::ctype<Ch> const &ctype) : ctype(ctype) { }
bool operator()(Ch const a, Ch const b) const
{
return a == b ||
('a' <= a && a <= 'z' || 'A' <= a && a <= 'Z') &&
(a & ~('a' - 'A')) == (b & ~('a' - 'A')) ||
a > SCHAR_MAX && b > SCHAR_MAX &&
ctype.toupper(a) == ctype.toupper(b);
}
};
which didn't help much, but then I thought, hey, why not swap the two sides of the &&
?
template<class Ch>
struct is_iequal
{
std::ctype<Ch> const &ctype;
is_iequal(std::ctype<Ch> const &ctype) : ctype(ctype) { }
bool operator()(Ch const a, Ch const b) const
{
return a == b ||
(a & ~('a' - 'A')) == (b & ~('a' - 'A')) &&
('a' <= a && a <= 'z' || 'A' <= a && a <= 'Z') ||
a > SCHAR_MAX && b > SCHAR_MAX &&
ctype.toupper(a) == ctype.toupper(b);
}
};
That brought it down to 138 ms!
Upvotes: 2
Reputation: 100545
Consider inlining toLower for c<127
- memory cost will be small enough to be in cache but speed may be better:
char localToLow[128] =....
return c1 < 127 && c2 < 127 ? localToLow[c1]==localToLow[c2] :
ctype.toupper(c1) == ctype.toupper(c2);
(< 127 can be replaced with ((c1 | c2) & ~127 ) :) )
Upvotes: 2