Optimizing Sieve of Eratosthenes, Adding Wheel to Bool Array

I know there's a few posts about how to implement wheels, but I'm really struggling to see how to implement one with my current approach to the sieve.

I created my own bit array in C, with the following implementation:

#define setBit1(Array, i) (Array[i/INT_BITS] |= (1 << i % INT_BITS))
#define getBit(Array, i) ((Array[i/INT_BITS] & (1 << i % INT_BITS)) ? 1 : 0)
#define setBit0(Array, i) (Array[i/INT_BITS] &= ~(1 << i % INT_BITS))

int * createBitArray(unsigned long long size) {

    // find bytes required, round down to nearest whole byte
    unsigned long long bytesRequired = size / BITS_PERBYTE;

    // round up to highest multiple of 4 or 8 bytes (one int) 
    bytesRequired = (sizeof(int) * (bytesRequired / sizeof(int) + 
        ((size % BITS_PERBYTE * sizeof(int) == 0) ? 0 : 1)));

    // allocate array of "bits", round number of ints required up 
    return (int *)malloc((bytesRequired)); 

}

I've done a few tests in C using the clock(), and I've found that for large array sizes of more than 1,000,000, the bit array, even with its bit manipulations, is at least 200% faster than an int array. It also uses 1/32 of the memory.

#define indexToNum(n) (2*n + 1) 
#define numToIndex(n) ((n - 1) / 2)
typedef unsigned long long LONG; 

// populates prime array through Sieve of Eratosthenes, taking custom 
// odd keyed bit array, and the raw array length, as arguments
void findPrimes(int * primes, LONG arrLength) {

    long sqrtArrLength = (long)((sqrt((2 * arrLength) + 1) - 1) / 2); 
    long maxMult = 0; 
    long integerFromIndex = 0; 

    for (int i = 1; i <= sqrtArrLength; i++) {

         if (!getBit(primes, i)) {
             integerFromIndex = indexToNum(i); 
             maxMult = (indexToNum(arrLength)) / integerFromIndex; 

             for (int j = integerFromIndex; j <= maxMult; j+= 2) {
                 setBit1(primes, numToIndex((integerFromIndex*j)));     
            }

        }   
    }
}

I've been populating the bit array with the index, i, representing a number obtained through (2i + 1). This has the benefit of reducing any time spent iterating over even numbers, and once again reducing the necessary memory of the array by half. 2 is manually added to the primes after. This comes with a consequence of the time spent translating between index and number, but with my tests, for more than 1,000 primes, this approach is faster.

I'm stumped on how I could further optimize; I've reduced array size, I'm only testing to sqrt(n), I'm starting the "sieving" of primes upwards from p * p, I've eliminated all evens, and it's still taking me about 60 seconds for the first 100,000,000 primes in C.

As far as I'm aware, the "wheel" method requires that the actual integers of numbers be stored in the index. I'm really stuck on implementing it with my current bit array.

Upvotes: 1

Answers (2)

Matt Timmermans

Reputation: 59233

When I fix your implementation and run it on my Macbook Pro, it takes 17 seconds to mark off all composites <= 2^31, which is pretty fast.

There are some other things you can try, though. Using a wheel may cut your time in half.

Euler's sieve is linear time if it's carefully implemented, but it requires an int array instead of a bit array.

The Atkin sieve takes linear time and is very practical: https://en.wikipedia.org/wiki/Sieve_of_Atkin

And finally my own (that just means I haven't seen it anywhere else, but I didn't really look either) super fun sieve also takes linear time and finds all the primes <= 2^31 in 6.5 seconds. Thanks for giving me an excuse to post it:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <memory.h>
#include <math.h>

#define SETBIT(mem,num) ( ((uint8_t *)mem)[(num)>>4] |= ((uint8_t)1)<<(((num)>>1)&7) )

int main(int argc, char *argv[])
{
    //we'll find all primes <= this
    uint32_t MAXTEST = (1L<<31)-1;
    //which will be less than this many
    size_t MAXPRIMES = 110000000;

    //We'll find this many primes with the sieve of Eratosthenes.
    //After that, we'll switch to a linear time algorithm
    size_t NUMPREMARK = 48;

    //Allocate a bit array for all odd numbers <= MAXTEST
    size_t NBYTES = (MAXTEST>>4)+1;
    uint8_t *mem = malloc(NBYTES);
    memset(mem, 0, NBYTES);

    //We'll store the primes in here
    unsigned *primes = (unsigned *)malloc(MAXPRIMES*sizeof(unsigned));
    size_t nprimes = 0;


    clock_t start_t = clock();

    //first premark with sieve or Eratosthenes
    primes[nprimes++]=2;
    for (uint32_t test=3; nprimes<NUMPREMARK; test+=2)
    {
        if ( mem[test>>4] & ((uint8_t)1<<((test>>1)&7)) )
        {
            continue;
        }
        primes[nprimes++]=test;
        uint32_t inc=test<<1;
        for(uint32_t prod=test*test; prod<=MAXTEST; prod+=inc)
        {
            SETBIT(mem,prod);
        }
    }


    //Iterate through all products of the remaining primes and mark them off, 
    //in linear time. Products are visited in lexical order of their  
    //prime factorizations, with factors in descending order.

    //stacks containing the current prime indexes and partial products for 
    //prefixes of the current product
    size_t stksize=0;
    size_t indexes[64];
    uint32_t products[64];

    for (uint32_t test=primes[NUMPREMARK-1]+2; test<=MAXTEST; test+=2)
    {
        if ( mem[test>>4] & ((uint8_t)1<<((test>>1)&7)) )
        {
            continue;
        }

        //found a prime!  iterate through all products that start with this one
        //They can only contain the same or smaller primes

        //current product
        uint32_t curprod = (uint32_t)test;
        indexes[0] = nprimes;
        products[0] = curprod;
        stksize = 1;

        //remember the found prime (first time through, nprimes == NUMPREMARK)
        primes[nprimes++] = curprod;

        //when we extend the product, we add the first non-premarked prime
        uint32_t extensionPrime = primes[NUMPREMARK];
        //which we can only do if the current product is at most this big
        uint32_t extensionMax = MAXTEST/primes[NUMPREMARK];

        while (curprod <= extensionMax)
        {
            //extend the product with the smallest non-premarked prime
            indexes[stksize]=NUMPREMARK;
            products[stksize++]=(curprod*=extensionPrime);
            SETBIT(mem, curprod);
        }

        for (;;)
        {
            //Can't extend current product.
            //Pop the stacks until we get to a factor we can increase while keeping
            //the factors in descending order and keeping the product small enough
            if (--stksize <= 0)
            {
                //didn't find one
                break;
            }
            if (indexes[stksize]>=indexes[stksize-1])
            {
                //can't increase this factor without breaking descending order
                continue;
            }

            uint64_t testprod=products[stksize-1];
            testprod*=primes[++(indexes[stksize])];
            if (testprod>MAXTEST)
            {
                //can't increase this factor without exceeding our array size
                continue;
            }
            //yay! - we can increment here to the next composite product
            curprod=(uint32_t)testprod;
            products[stksize++] = curprod;
            SETBIT(mem, curprod);

            while (curprod <= extensionMax)
            {
                //extend the product with the smallest non-premarked prime
                indexes[stksize]=NUMPREMARK;
                products[stksize++]=(curprod*=extensionPrime);
                SETBIT(mem, curprod);
            }
        }
    }
    clock_t end_t = clock();
    printf("Found %ld primes\n", nprimes);
    free(mem);
    free(primes);
    printf("Time: %f\n", (double)(end_t - start_t) / CLOCKS_PER_SEC);
}

Note that my sieve starts with a sieve or Eratosthenes that is a little bit better optimized than yours. The major difference is that we only allocate bits for odd numbers in the bit mask array. The speed difference in that part is not significant.

Upvotes: 1

MrSmith42

Reputation: 10151

It will always be slower, because of the bit-operation overhead.

But you can try to optimize that.

setBit1(Array, i) can be improved by using a constant array of all preshiftet bits ( I call it ONE_BIT )

NEW:
#define setBit1(Array, i) (Array[i/INT_BITS] |= ONE_BIT[ i % INT_BITS])

same for setBit0(Array, i)
NEW:
#define setBit0(Array, i) (Array[i/INT_BITS] &= ALL_BUT_ONE_BIT[ i % INT_BITS])

Also INT_BITS is most likely a power of 2, so you can replace
i % INT_BITS
by
i & (INT_BITS-1) // of cause you should store INT_BITS-1 in a constant and use that

If and how mutch this can speed up your code, has to be checked by profiling each change.

Upvotes: 0

Optimizing Sieve of Eratosthenes, Adding Wheel to Bool Array

Answers (2)

Related Questions