Reputation: 2851
I noticed a curious thing on my computer.* The handwritten divisibility test is significantly faster than the %
operator. Consider the minimal example:
* AMD Ryzen Threadripper 2990WX, GCC 9.2.0
static int divisible_ui_p(unsigned int m, unsigned int a)
{
if (m <= a) {
if (m == a) {
return 1;
}
return 0;
}
m += a;
m >>= __builtin_ctz(m);
return divisible_ui_p(m, a);
}
The example is limited by odd a
and m > 0
. However, it can be easily generalized to all a
and m
. The code just converts the division to a series of additions.
Now consider the test program compiled with -std=c99 -march=native -O3
:
for (unsigned int a = 1; a < 100000; a += 2) {
for (unsigned int m = 1; m < 100000; m += 1) {
#if 1
volatile int r = divisible_ui_p(m, a);
#else
volatile int r = (m % a == 0);
#endif
}
}
... and the results on my computer:
| implementation | time [secs] |
|--------------------|-------------|
| divisible_ui_p | 8.52user |
| builtin % operator | 17.61user |
Therefore more than 2 times faster.
The question: Can you tell me how the code behaves on your machine? Is it missed optimization opportunity in GCC? Can you do this test even faster?
UPDATE: As requested, here is a minimal reproducible example:
#include <assert.h>
static int divisible_ui_p(unsigned int m, unsigned int a)
{
if (m <= a) {
if (m == a) {
return 1;
}
return 0;
}
m += a;
m >>= __builtin_ctz(m);
return divisible_ui_p(m, a);
}
int main()
{
for (unsigned int a = 1; a < 100000; a += 2) {
for (unsigned int m = 1; m < 100000; m += 1) {
assert(divisible_ui_p(m, a) == (m % a == 0));
#if 1
volatile int r = divisible_ui_p(m, a);
#else
volatile int r = (m % a == 0);
#endif
}
}
return 0;
}
compiled with gcc -std=c99 -march=native -O3 -DNDEBUG
on AMD Ryzen Threadripper 2990WX with
gcc --version
gcc (Gentoo 9.2.0-r2 p3) 9.2.0
UPDATE2: As requested, the version that can handle any a
and m
(if you also want to avoid integer overflow, the test has to be implemented with integer type twice as long as the input integers):
int divisible_ui_p(unsigned int m, unsigned int a)
{
#if 1
/* handles even a */
int alpha = __builtin_ctz(a);
if (alpha) {
if (__builtin_ctz(m) < alpha) {
return 0;
}
a >>= alpha;
}
#endif
while (m > a) {
m += a;
m >>= __builtin_ctz(m);
}
if (m == a) {
return 1;
}
#if 1
/* ensures that 0 is divisible by anything */
if (m == 0) {
return 1;
}
#endif
return 0;
}
Upvotes: 23
Views: 952
Reputation: 2851
I will answer my question myself. It seems that I became a victim of branch prediction. The mutual size of the operands does not seem to matter, only their order.
Consider the following implementation
int divisible_ui_p(unsigned int m, unsigned int a)
{
while (m > a) {
m += a;
m >>= __builtin_ctz(m);
}
if (m == a) {
return 1;
}
return 0;
}
and the arrays
unsigned int A[100000/2];
unsigned int M[100000-1];
for (unsigned int a = 1; a < 100000; a += 2) {
A[a/2] = a;
}
for (unsigned int m = 1; m < 100000; m += 1) {
M[m-1] = m;
}
which are / are not shuffled using the shuffle function.
Without shuffling, the results are still
| implementation | time [secs] |
|--------------------|-------------|
| divisible_ui_p | 8.56user |
| builtin % operator | 17.59user |
However, once I shuffle these arrays, the results are different
| implementation | time [secs] |
|--------------------|-------------|
| divisible_ui_p | 31.34user |
| builtin % operator | 17.53user |
Upvotes: 9
Reputation: 15124
What you’re doing is called strength reduction: replacing an expensive operation with a series of cheap ones.
The mod instruction on many CPUs is slow, because it historically was not tested in several common benchmarks and the designers therefore optimized other instructions instead. This algorithm will perform worse if it has to do many iterations, and %
will perform better on a CPU where it needs only two clock cycles.
Finally, be aware that there are many shortcuts to take the remainder of division by specific constants. (Although compilers will generally take care of this for you.)
Upvotes: 12