njuffa
njuffa

Reputation: 26185

Efficient modulo-255 computation

I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.

The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2, WARREN_MUL_SHR_1, and DIGIT_SUM_CARRY_OUT_1 seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA, byte-expansion instructions, multiply-accumulate, and instruction predication.

The variant NAIVE_USING_DIV uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1 one can sum base digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD uses this method.

A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20

showed a different way of adding digits modulo base-1 efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b, ADC a, 0 sequence. Writing out the addition chain for this using base 256 digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base digit comprises 2k bits. Here we have k=3. I tried three different ways of remapping a result of base-1 to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1, DIGIT_SUM_CARRY_OUT_2, DIGIT_SUM_CARRY_OUT_3.

An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));. This makes it a close relative of the next variant.

Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2k-1 = floor (2k / 2k-1 * n) mod 2k. Fixed point computation is used to multiply with the factor 2k / 2k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1 to 0. These are variants WARREN_MUL_SHR_1 and WARREN_MUL_SHR_2.

Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256 digits would seem to be of particular interest in this context.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NAIVE_USING_DIV       (1)
#define DIGIT_SUM_THEN_FOLD   (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC           (6)  // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1      (7)  // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2      (8)  // Hacker's Delight, 2nd ed., p. 272

#define VARIANT (WARREN_MUL_SHR_2)

uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
    return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
    x = (x & 0xffff) + (x >> 16);
    x = (x & 0xff) + (x >> 8);
    x = (x & 0xff) + (x >> 8) + 1;
    x = (x & 0xff) + (x >> 8) - 1;
    return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x);
    if (t == 255) t = 0;
    return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x) + 1;
    t = (t & 0xff) + (t >> 8) - 1;
    return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x);
    t = t & ((t - 255) >> 8);
    return t;
#elif VARIANT == KEANE_MAGIC
    x = (((x >> 16) + x) >> 14) + (x << 2);
    x = ((x >> 8) + x + 2) & 0x3ff;
    x = (x - (x >> 8)) >> 2;
    return x;
#elif VARIANT == WARREN_MUL_SHR_1
    x = (0x01010101 * x + (x >> 8)) >> 24;
    x = x & ((x - 255) >> 8);
    return x;
#elif VARIANT == WARREN_MUL_SHR_2
    x = (0x01010101 * x + (x >> 8)) >> 24;
    if (x == 255) x = 0;
    return x;
#else
#error unknown VARIANT
#endif
}

uint32_t ref_mod255 (uint32_t x)
{
    volatile uint32_t t = x;
    t = t % 255;
    return t;
}

// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
    double start, stop;
    uint32_t res, ref, x = 0;

    printf ("Testing VARIANT = %d\n", VARIANT);
    start = second();
    do {
        res = mod255 (x);
        ref = ref_mod255 (x);
        if (res != ref) {
            printf ("error @ %08x: res=%08x ref=%08x\n", x, res, ref);
            return EXIT_FAILURE;
        }        
        x++;
    } while (x);
    stop = second();
    printf ("test passed\n");
    printf ("elapsed = %.6f seconds\n", stop - start);
    return EXIT_SUCCESS;
}

Upvotes: 20

Views: 2286

Answers (5)

Aki Suihkonen
Aki Suihkonen

Reputation: 20037

If we were to have a builtin, intrinsic, or method that is optimised to single instruction addc, one could use 32-bit arithmetic in the following way:

uint32_t carry = 0;
// sum up top and bottom 16 bits while generating carry out
x = __builtin_addc(x, x<<16, carry, &carry);
x &= 0xffff0000;
// store the previous carry to bit 0 while adding
// bits 16:23 over bits 24:31, and producing one more carry
x = __builtin_addc(x, x << 8, carry, &carry);  
x = __builtin_addc(x, x >> 24, carry, &carry);  
x &= 0x0000ffff;   // actually 0x1ff is enough
// final correction for 0<=x<=257, i.e. min(x,x-255)
x = x < x-255 ? x : x - 255;  

In Arm64 at least the regular add instruction can take the form of add r0, r1, r2 LSL 16; the masking with immediate or clearing consecutive bits is a single instruction bfi r0, wzr, #start_bit, #length.

For parallel calculation one can't use that efficiently widening multiplication. Instead one can divide-and-conquer while calculating carries -- starting with 16 uint32_t elements interpreted as 16+16 uint16_t elements, then moving to uint8_t arithmetic, one can calculate one result in slightly less than one instruction.

a0 = vld2q_u16(ptr);     // split input to top16+bot16 bits
a1 = vld2q_u16(ptr + 8); // load more inputs
auto b0 = vaddq_u16(a0.val[0], a0.val[1]);
auto b1 = vaddq_u16(a1.val[0], a1.val[1]);
auto c0 = vcltq_u16(b0, a0.val[1]); // 8 carries
auto c1 = vcltq_u16(b1, a1.val[1]); // 8 more carries
b0 = vsubq_u16(b0, c0);
b1 = vsubq_u16(b1, c1);
auto d = vuzpq_u8(b0, b1);
auto result = vaddq_u8(d.val[0], d.val[1]);
auto carry = vcltq_u8(result, d.val[1]);
result = vsubq_u8(result, carry);
auto is_255 = vceqq_u8(result, vdupq_n_u8(255));
result = vbicq_u8(result, is_255);

Upvotes: 2

David Eisenstat
David Eisenstat

Reputation: 65498

This method (improved slightly since the previous edit) mashes up Warren and Keane. On my laptop, it’s faster than Keane but not as fast as a 64-bit multiply and shift. It avoids multiplication but benefits from a single rotate instruction. Unlike the original version, it’s probably OK on RISC-V.

Like Warren, this method approximates ⌊(256/255) x mod 256⌋ in 8.24 fixed point. Mod 256, each byte b contributes a term (256/255) b, which is approximately b.bbb base 256. The original version of this method just sums all four byte rotations. (I’ll get to the revised version in a moment.) This sum always underestimates the real value, but by less than 4 units in the last place. By adding 4/2−24 before truncating, we guarantee the right answer as in Keane.

The revised version saves work by relaxing the approximation quality. We write (256/255) x = (257/256) (65536/65535) x, evaluate (65536/65535) x in 16.16 fixed point (i.e., add x to its 16-bit rotation), and then multiply by 257/256 and mod by 256 into 8.24 fixed point. The first multiplication has error less than 2 units in the last place of 16.16, and the second is exact (!). The sum underestimates by less than (2/216) (257/256), so a constant term of 514/224 suffices to fix the truncation. It’s also possible to use a greater value in case a different immediate operand is more efficient.

uint32_t mod255(uint32_t x) {
  x += (x << 16) | (x >> 16);
  return ((x << 8) + x + 514) >> 24;
}

Upvotes: 7

Adrian Mole
Adrian Mole

Reputation: 51874

For arbitrary unsigned integers, x and n, evaluating the modulo expression x % n involves (conceptually, at least), three operations: division, multiplication and subtraction:

quotient = x / n;
product = quotient * n;
modulus = x - product;

However, when n is a power of 2 (n = 2p), the modulo can be determined much more rapidly, simply by masking out all but the lower p bits.

On most CPUs, addition, subtraction and bit-masking are very 'cheap' (rapid) operations, multiplication is more 'expensive' and division is very expensive – but note that most optimizing compilers will convert division by a compile-time constant into a multiplication (by a different constant) and a bit-shift (vide infra).

Thus, if we can convert our modulo 255 into a modulo 256, without too much overhead, we can likely speed up the process. We can do just this by noting that x % n is equivalent to (x + x / n) % (n + 1). Thus, our conceptual operations are now: division, addition and masking.

In the specific case of masking the lower 8 bits, x86/x64-based CPUs (and others?) will likely be able to perform a further optimization, as they can access 8-bit versions of (most) registers.

Here's what the clang-cl compiler generates for a naïve modulo 255 function (argument passed in ecx and returned in eax):

unsigned Naive255(unsigned x)
{
    return x % 255;
}
    mov     edx, ecx
    mov     eax, 2155905153 ;
    imul    rax, rdx        ; Replacing the IDIV with IMUL and SHR
    shr     rax, 39         ;
    mov     edx, eax
    shl     edx, 8
    sub     eax, edx
    add     eax, ecx

And here's the (clearly faster) code generated using the 'trick' described above:

unsigned Trick255(unsigned x)
{
    return (x + x / 255) & 0xFF;
}
    mov     eax, ecx
    mov     edx, 2155905153
    imul    rdx, rax
    shr     rdx, 39
    add     edx, ecx
    movzx   eax, dl         ; Faster than an explicit AND mask?

Testing this code on a Windows-10 (64-bit) platform (Intel® Core™ i7-8550U CPU) shows that it significantly (but not hugely) out-performs the other algorithms presented in the question.


The answer given by David Eisenstat explains how/why this equivalence is valid.

Upvotes: 14

David Eisenstat
David Eisenstat

Reputation: 65498

Guess you're probably not looking for solutions that require fast 64-bit multiplication, but for the record:

return (x * 0x101010101010102ULL) >> 56;

Upvotes: 9

David Eisenstat
David Eisenstat

Reputation: 65498

Here’s my sense of how the fastest answers work. I don’t know yet whether Keane can be improved or easily generalized.

Given an integer x ≥ 0, let q = ⌊x/255⌋ (in C, q = x / 255;) and r = x − 255 q (in C, r = x % 255;) so that q ≥ 0 and 0 ≤ r < 255 are integers and x = 255 q + r.

Adrian Mole’s method

This method evaluates (x + ⌊x/255⌋) mod 28 (in C, (x + x / 255) & 0xff), which equals (255 q + r + q) mod 28 = (28 q + r) mod 28 = r.

Henry S. Warren’s method

Note that x + ⌊x/255⌋ = ⌊x + x/255⌋ = ⌊(28/255) x⌋, where the first step follows from x being an integer. This method uses the multiplier (20 + 2−8 + 2−16 + 2−24 + 2−32) instead of 28/255, which is the sum of the infinite series 20 + 2−8 + 2−16 + 2−24 + 2−32 + …. Since the approximation is slightly under, this method must detect the residue 28 − 1 = 255.

Joe Keane’s method

The intuition for this method is to compute y = (28/255) x mod 28, which equals (28/255) (255 q + r) mod 28 = (28 q + (28/255) r) mod 28 = (28/255) r, and return y − y/28, which equals r.

Since these formulas don’t use the fact that ⌊(28/255) r⌋ = r, Keane can switch from 28 to 210 for two guard bits. Ideally, these would always be zero, but due to fixed-point truncation and an approximation for 210/255, they’re not. Keane adds 2 to switch from truncation to rounding, which also avoids the special case in Warren.

This method sort of uses the multiplier 22 (20 + 2−8 + 2−16 + 2−24 + 2−32 + 2−40) = 22 (20 + 2−16 + 2−32) (20 + 2−8). The C statement x = (((x >> 16) + x) >> 14) + (x << 2); computes x′ = ⌊22 (20 + 2−16 + 2−32) x⌋ mod 232. Then ((x >> 8) + x) & 0x3ff is x′′ = ⌊(20 + 2−8) x′⌋ mod 210.

I don’t have time right now to do the error analysis formally. Informally, the error interval of the first computation has width < 1; the second, width < 2 + 2−8; the third, width < ((2 − 2−8) + 1)/22 < 1, which allows correct rounding.

Regarding improvements, the 2−40 term of the approximation seems not necessary (?), but we might as well have it unless we can drop the 2−32 term. Dropping 2−32 pushes the approximation quality out of spec.

Upvotes: 9

Related Questions