Huy Le
Huy Le

Reputation: 1738

Fast integer division and modulo with a const runtime divisor

int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<char> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache
    
    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.emplace_back(x, y);
    if (selectedPairs.size() == n_attrs / 2) break;
}

I have a function that looks like this. The bottleneck is in

    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;

n_attrs is const during the loop, so I wish to find a way to speed up this loop. corr_indexes[i], n_attrs > 0, < max_int32. Edit: please note that n_attrs isn't compile-time const.

How can I optimize this loop? No extra library is allowed. Also, is their any way to parallelize this loop (either CPU or GPU are okay, everything is already on GPU memory before this loop).

Upvotes: 5

Views: 3309

Answers (4)

Huy Le
Huy Le

Reputation: 1738

So the actual best solution in my case.

Instead of representing index = row * n_cols + col, do index = (row << 16) | col for 32 bit, or index = (row << 32) | col for 64 bits. Then row = index >> 32, col = index & (32 - 1). Or even better, just uint16_t* pairs = reinterpret_cast<uint16_t*>(index_array);, then pair[i], pair[i+1] for each i % 2 == 0 is a pair.

This is assuming the number of rows/columns is less than 2^16 (or 2^32).

I'm still keeping the top answer because it still answers the case where division has to be used.

Upvotes: 0

njuffa
njuffa

Reputation: 26085

I am restricting my comments to integer division, because to first order the modulo operation in C++ can be viewed and implemented as an integer division plus back-multiply and subtraction, although in some cases, there are cheaper ways of computing the modulo directly, e.g. when computing modulo 2n.

Integer division is pretty slow on most platforms, based on either software emulation or an iterative hardware implementation. But it was widely reported last year that based on microbenchmarking on Apple's M1, it has a blazingly fast integer division, presumably by using dedicated circuitry.

Ever since a seminal paper by Torbjörn Granlund and Peter Montgomery almost thirty years ago it has been widely known how to replace integer divisions with constant divisors by using an integer multiply plus possibly a shift and / or other correction steps. This algorithm is often referred to as the magic-multiplier technique. It requires precomputation of some relevant parameters from the integer divisor for use in the multiply-based emulation sequence.

Torbjörn Granlund and Peter L. Montgomery, "Division by invariant integers using multiplication," ACM SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72 (online).

At current, all major toolchains incorporate variants of the Granlund-Montgomery algorithm when dealing with integer divisors that are compile-time constant. The pre-computation occurs at compilation time inside the compiler, which then emits code using the computed parameters. Some toolchains may also use this algorithm for divisions by run-time constant divisors that are used repeatedly. For run-time constant divisors in loops, this could involve emitting a pre-computation block prior to a loop to compute the necessary parameters, and then using those for the division emulation code inside the loop.

If one's toolchain does not optimize divisions with run-time constant divisor one can use the same approach manually as demonstrated by the code below. However, this is unlikely to achieve the same efficiency as a compiler-based solution, because not all machine operations used in the desired emulation sequence can be expressed efficiently at C++ level in a portable manner. This applies in particular to arithmetic right shifts and add-with-carry.

The code below demonstrates the principle of parameter precomputation and integer division emulation via multiplication. It is quite likely that by investing more time into the design than I was willing to expend for this answer more efficient implementations of both parameter precomputation and emulation can be identified.

#include <cstdio>
#include <cstdlib>
#include <cstdint>

#define PORTABLE  (1)

uint32_t ilog2 (uint32_t i)
{
    uint32_t t = 0;
    i = i >> 1;
    while (i) {
        i = i >> 1;
        t++;
    }
    return (t);
}

/* Based on: Granlund, T.; Montgomery, P.L.: "Division by Invariant Integers 
   using Multiplication". SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72
*/
void prepare_magic (int32_t divisor, int32_t &multiplier, int32_t &add_mask, int32_t &sign_shift)
{
    uint32_t divisoru, d, n, i, j, two_to_31 = uint32_t (1) << 31;
    uint64_t m_lower, m_upper, k, msb, two_to_32 = uint64_t (1) << 32;

    divisoru = uint32_t (divisor);
    d = (divisor < 0) ? (0 - divisoru) : divisoru;
    i = ilog2 (d);
    j = two_to_31 % d;
    msb = two_to_32 << i;
    k = msb / (two_to_31 - j);
    m_lower = msb / d;
    m_upper = (msb + k) / d;
    n = ilog2 (uint32_t (m_lower ^ m_upper));
    n = (n > i) ? i : n;
    m_upper = m_upper >> n;
    i = i - n;
    multiplier = int32_t (uint32_t (m_upper));
    add_mask = (m_upper >> 31) ? (-1) : 0;
    sign_shift = int32_t ((divisoru & two_to_31) | i);
}

int32_t arithmetic_right_shift (int32_t a, int32_t s)
{
    uint32_t msb = uint32_t (1) << 31;
    uint32_t ua = uint32_t (a);
    ua = ua >> s;
    msb = msb >> s;
    return int32_t ((ua ^ msb) - msb);
}

int32_t magic_division (int32_t dividend, int32_t multiplier, int32_t add_mask, int32_t sign_shift)
{
    int64_t prod = int64_t (dividend) * multiplier;
    int32_t quot = (int32_t)(uint64_t (prod) >> 32);
    quot = int32_t (uint32_t (quot) + (uint32_t (dividend) & uint32_t (add_mask)));
#if PORTABLE
    const int32_t byte_mask = 0xff;
    quot = arithmetic_right_shift (quot, sign_shift & byte_mask);
#else // PORTABLE
    quot = quot >> sign_shift; // must mask shift count & use arithmetic right shift
#endif // PORTABLE
    quot = int32_t (uint32_t (quot) + (uint32_t (dividend) >> 31));
    if (sign_shift < 0) quot = -quot;
    return quot;
}

int main (void)
{
    int32_t multiplier;
    int32_t add_mask;
    int32_t sign_shift;
    int32_t divisor;
    
    for (divisor = -20; divisor <= 20; divisor++) {
        /* avoid division by zero */
        if (divisor == 0) {
            divisor++;
            continue;
        }
        printf ("divisor=%d\n", divisor);
        prepare_magic (divisor, multiplier, add_mask, sign_shift);
        printf ("multiplier=%d add_mask=%d sign_shift=%d\n", 
                multiplier, add_mask, sign_shift);
        printf ("exhaustive test of dividends ... ");
        uint32_t dividendu = 0;
        do {
            int32_t dividend = (int32_t)dividendu;
            /* avoid overflow in signed integer division */
            if ((divisor == (-1)) && (dividend == ((-2147483647)-1))) {
                dividendu++;
                continue;
            }
            int32_t res = magic_division (dividend, multiplier, add_mask, sign_shift);
            int32_t ref = dividend / divisor;
            if (res != ref) {
                printf ("\nERR dividend=%d (%08x) divisor=%d  res=%d  ref=%d\n",
                        dividend, (uint32_t)dividend, divisor, res, ref);
                return EXIT_FAILURE;
            }
            dividendu++;
        } while (dividendu);
        printf ("PASSED\n");
    }
    return EXIT_SUCCESS;
}

Upvotes: 14

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50348

How can I optimize this loop?

This is a perfect use-case for libdivide. This library has been designed to speed up division by constant at run-time by using the strategy compilers use at compile-time. The library is header-only so it does not create any run-time dependency. It also support the vectorization of divisions (ie. using SIMD instructions) which is definitively something to use in this case to drastically speed up the computation which compilers cannot do without changing significantly the loop (and in the end it will be not as efficient because of the run-time-defined divisor). Note that the licence of libdivide is very permissive (zlib) so you can easily include it in your project without strong constraints (you basically just need to mark it as modified if you change it).

If header only-libraries are not OK, then you need to reimplement the wheel. The idea is to transform a division by a constant to a sequence of shift and multiplications. The very good answer of @njuffa specify how to do that. You can also read the code of libdivide which is highly optimized.

For small positive divisors and small positive dividends, there is no need for a long sequence of operation. You can cheat with a basic sequence:

uint64_t dividend = corr_indexes[i]; // Must not be too big
uint64_t divider = n_attrs;
uint64_t magic_factor = 4294967296 / n_attrs + 1; // Must be precomputed once
uint32_t result = (dividend * magic_factor) >> 32;

This method should be safe for uint16_t dividends/divisors, but it is not for much bigger values. In practice if fail for dividend values above ~800_000. Bigger dividends require a more complex sequence which is also generally slower.

is their any way to parallelize this loop

Only the division/modulus can be safely parallelized. There is a loop carried dependency in the rest of the loop that prevent any parallelization (unless additional assumptions are made). Thus, the loop can be split in two parts: one that compute the division and put the uint16_t results in a temporary array computed later serially. The array needs not to be too big, since the computation would be memory bound otherwise and the resulting parallel code can be slower than the current one. Thus, you need to operate on small chunks that fit in at least the L3 cache. If chunks are too small, then thread synchronizations can also be an issue. The best solution is certainly to use a rolling window of chunks. All of this is certainly a bit tedious/tricky to implement.

Note that SIMD instructions can be used for the division part (easy with libdivide). You also need to split the loop and use chunks but chunks do not need to be big since there is no synchronization overhead. Something like 64 integers should be enough.


Note that recent processor can compute divisions like this efficiently, especially for 32-bit integers (64-bit ones tends to be significantly more expensive). This is especially the case of the Alder lake, Zen3 and M1 processors (P-cores). Note that both the modulus and the division are computed in one instruction on x86/x86-64 processors. Also note that while the division has a pretty big latency, many processors can pipeline multiple divisions so to get a reasonable throughput. For example, a 32-bit div instruction has a latency of 23~28 cycles on Skylake but a reciprocal throughput of 4~6. This is apparently not the case on Zen1/Zen2.

Upvotes: 4

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11920

I would optimize the part after // optimize the code below by:

  • taking n_attrs
  • generating a function string like this:
void dynamicFunction(MyType & selectedPairs, Foo & selected)
{
    const int npairs = @@ * @@;
    selectedPairs.clear();
    for (int i = 0; i < npairs; i++) {
        const int x = corr_indexes[i] / @@;
        const int y = corr_indexes[i] % @@;
        if (selected[x] || selected[y]) continue; // fit inside L1 cache
    
        // below lines are called max 2500 times, so they're insignificant
        selected[x] = true;
        selected[y] = true;
        selectedPairs.emplace_back(x, y);
        if (selectedPairs.size() == @@ / 2) 
            break;
    }
}

  • replacing all @@ with value of n_attrs
  • compiling it, generating a DLL
  • linking and calling the function

So that the n_attrs is a compile-time constant value for the DLL and the compiler can automatically do most of its optimization on the value like:

  • doing n&(x-1) instead of n%x when x is power-of-2 value
  • shifting and multiplying instead of dividing
  • maybe other optimizations too, like unrolling the loop with precalculated indices for x and y (since x is known)

Some integer math operations in tight-loops are easier to SIMDify/vectorize by compiler when more of the parts are known in compile-time.

If your CPU is AMD, you can even try magic floating-point operations in place of unknown/unknown division to get vectorization.

By caching all (or big percentage of) values of n_attrs, you can get rid of latencies of:

  • string generation
  • compiling
  • file(DLL) reading (assuming some object-oriented wrapping of DLLs)

If the part to be optimized will be run in GPU, there is high possibility of CUDA/OpenCL implementation already doing the integer division in means of floating-point (to keep SIMD path occupied instead of being serialized on integer division) or just being capable directly as SIMD integer operations so you may just use it as it is in the GPU, except the std::vector which is not supported by all C++ CUDA compilers (and not in OpenCL kernel). These host-environment-related parts could be computed after the kernel (with the parts excluding emplace_back or exchanged with a struct that works in GPU) is executed.

Upvotes: 2

Related Questions