Reputation: 21

Optimize code for large number of iterations

I am currently working on a project which involves a large number of iterations (2^32 to be exact). I have mainly been using mathematica for most of my calculations, but it can't handle that sort of amount of processes. It has been suggested to me that c++ would be able to handle it, so last night I learnt c++ and wrote the following code:

//old code

The code runs fine, (I checked on smaller parameters) but I have started running it for 4294967295 = 2^32-1 steps, and I think it will take hundreds of hours. I would be really grateful if someone could tell me if there are ways to optimize bits of this code so that it will run faster? I have no experience with this kind of language and so how I've constructed the functions probably looks quite messy. I think my Ca2step function runs quite efficiently (I'm probably wrong), and I think its my loops in the main section that are slowing everything down. I think there must be faster methods for what I am trying to accomplish, so any help would be great. Thanks, Richard.

======= UPDATE ========

Thanks so much everybody, I really appreciate it. Ok I this is all very new to me, so I'm finding it quite hard to understand what some things mean. Below is my updated code. However I get the feeling that it's still just as slow. Some people suggested "parallelising" but I don't know what this is and how I would do it? Thanks again, Richard.

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

//parameters
int a[32] = {0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
             1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1};
int b[32] = {1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 
             1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1};
// Create vector of vectors from arrays to be input into function.
vector<int> va (a, a + sizeof(a) / sizeof(int) );
vector<int> vb (b, b + sizeof(b) / sizeof(int) );

vector< vector<int> > ca2step (long int r, vector< vector<int> > vec)
{
    int rulearray[32] = { 0 };
    for (int pos = 31; pos >= 0; --pos){
        if (r % 2) 
            rulearray[pos] = 1;
        r /= 2;
    }
    int arraya[32] = {0};
    int arrayb[32] = {0};
    for (int i = 0; i < 32; i++) {
        arraya[i] = vec[0][i];
        arrayb[i] = vec[1][i];
    }

    vector< vector<int> > output;
    typedef int t_array[32];
    t_array vll, vl, vr, vrr, vx;

    rotate_copy(arrayb,arrayb+2,arrayb+32,vll);
    rotate_copy(arrayb,arrayb+1,arrayb+32,vl);    
    rotate_copy(arrayb,arrayb+31,arrayb+32,vr);    
    rotate_copy(arrayb,arrayb+30,arrayb+32,vrr);


    for (int i = 0; i < 32; i++) {
        vx[i] = (arraya[i] + rulearray[(31 - (vll[i] + (2 * vl[i]) 
                                           + (4 * arrayb[i]) + (8 * vr[i]) + (16 * vrr[i])))]) % 2;
    }

    output.push_back(vector<int>(arrayb, arrayb+32));
    output.push_back(vector<int>(vx, vx+32));

    return (output);

}

int caevolve ( long int r, vector< vector<int> > vector ){
    int count;
    for(int j=0; j<20; j++){ 
        //run function
        vector = ca2step(r, vector);
    }
    if (vector[0] == va || vector[1] == va) {
        count = 1;
        }
    else{
        count=0;
    }
    return (count);
}

int main ()
{
    vector< vector<int> > vinput;
    vinput.reserve(32);
    vinput.push_back(va);
    vinput.push_back(vb); 
    int counter = 0;

    for(unsigned long long int i=0;i<4294967295;i++){  //4294967295
        counter += caevolve(i, vinput);
        }

    cout<< "Counter : " << counter << endl;

    return 0;

}

Upvotes: 2

Answers (10)

chunyun

Reputation: 51

I passed by this thread and have a look at the problem. But it has been quite a while. Anyway, I have given it a try using some bitwise operators and openmp.

My assumptions: 1. processing of binary numbers 2. all 32 bits

I have replaced all the arrays by a single int, because your 32 wide array containing only '0' and '1' fits just nice into one int(4 bytes). Doing this helps you eliminate a few loops and save memory accesses.

Updated* learnt some new tricks, updated with some minimal assembly code

#include <iostream>
using namespace std;

#define MASK 0x1F  /*last 5 bits*/

unsigned int toInt(int a[32]){
    int result = 0;
    for(int i = 0; i<32;i++)  
    if(a[i]==1) result |= 1 << (31-i);
    return result;
}

inline unsigned int ror(unsigned int v,unsigned int sh){
//rotate v to the right by sh
    asm("ror %1,%0;" :"=r"(v) : "cI"(sh), "0"(v) );
    return v;
}

unsigned int compute(unsigned int rule, unsigned int target){
    unsigned int t = rol(target,3);
    unsigned int d = 0;
    unsigned int k;
    for(int i=0;i<32;i++){
        k  = ( t & MASK );
        d |= ( (rule>>k) & 1 ) << (31-i) ;
        t  =  rol(t,1);      
    }
    return d;
}

int ca2step (unsigned int rule, unsigned int a, unsigned int b ){
    unsigned int xx = a;
    unsigned int yy = b;

    int tmp;
    unsigned int d,tmpyy;

    for (int j=0; j<19;j++){
        d = compute(rule,yy); 
        tmpyy = xx ^ d ;
        xx = yy;
        yy = tmpyy;
    }
    return ( yy == a || yy == b ) ;  
}    

int main (){

    int a[32] = {0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 
                 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1};
    int b[32] = {1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
                 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1};
    int counter = 0;
    unsigned int aa = toInt(a);
    unsigned int bb = toInt(b);

    #pragma omp parallel for reduction(+:counter)
    for(unsigned int i=0;i < 0xffffffff ;i++){
        counter += ca2step(i, aa, bb);
    }

    cout << counter <<"\n";

    return 0;
}

compile with:

g++ filename.cpp -O3 -fopenmp

Upvotes: 0

Richard Wells

Reputation: 21

thanks for all the help, I've finally got this working in a reasonable time (approx 11 hours). Just thought I'd share my code. I will need to run this several times in the next few weeks, so if there are any other tricks i could use to get the time down further, suggestions would be appreciated!

    #include <iostream>
using namespace std;

bool is_equal ( int a[], int b[]){
    for (int i=0; i<32; i++ ){
        if ( a[i] != b[i] )
            return false;
    }
    return true;
}

int ca2step (long long int rule, int arraya[32], int arrayb[32] ){
    int count =0;
    int x[32];
    int y[32];
    for(int i=0;i<32;i++){
        x[i] = arraya[i];
        y[i] = arrayb[i];
    }

    for (int j=0; j<19;j++){
            int arrayc[32];
            for (int i=0; i<32; i++){
            arrayc[i] = (x[i] + ((rule >> ( y[(i+2)%32] + (2 * y[(i+1)%32]) + 
                   (4 * y[i]) + (8 * y[(i+31)%32]) + (16 * y[(i+30)%32])) )& 1))%2;
            }

            for(int k=0;k<32;k++){ 
                x[k] = y[k];
                y[k] = arrayc[k];}
    }

    if(is_equal(y, arraya) || is_equal(y, arrayb)){
        count++;}  
    return(count);     
}

int main (){
    int a[32] = {0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 
                 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1};
    int b[32] = {1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
                 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1};
    int counter = 0;
    for(long long int i=0;i<10000000;i++){  //4294967295
        counter += ca2step(i, a, b);
        }

    cout << counter ;
    return 0;
}

Upvotes: 0

Van Zuzu

Reputation: 41

This should be done by the compiler to some extent. In your case you should try parallelizing your code.

Upvotes: 1

Juliano

Reputation: 811

Aside from C++ performance, you should consider to parallelize the code and take advantage of multicore architectures . It seems to me that your problem is a classical example to do this.

Upvotes: 3

Will Ness

Reputation: 71065

Move all your vectors outside your ca2step function; make them even global variables. Use vector::reserve() to extend their size before you start push_back()ing into them, you know all the sizes. Since ca2step will now work on arrays that are external to it, it doesn't need to return anything, so no need for vectors-of-two-vectors; just use those two vectors directly, and when you're done, just vector::clear() them.

Also you might need to change your loop variable type to unsigned long or unsigned long long.

Upvotes: 1

CapelliC

Reputation: 60014

I think you could get rid of the initial loop to fill rulearray, replacing with bit testing on r: to test the nth bit, you can use

(r & (1 << nth)) ? 1 : 0 ...

then usage of rulearray could be replaced by

arraya[i] + (r & (1 << (31 - (vll[i] + (2 * vl[i]) + (4 * arrayb[i]) + (8 * vr[i]) + (16 * vrr[i])) ?  1 : 0)

rotate_copy can be used with plain old arrays: and you can avoid much dynamic memory allocation using that, because all sizes are fixed. Enforce this using a typedef:

 typedef int t_array[32];
 t_array arraya, arrayb, vll, vl, vr, vrr, vx;

 rotate_copy(arrayb,arrayb+2,arrayb+32,vll);
 rotate_copy(arrayb,arrayb+1,arrayb+32,vl);    
 rotate_copy(arrayb,arrayb+31,arrayb+32,vr);    
 rotate_copy(arrayb,arrayb+30,arrayb+32,vrr);

Then just the final return value need a copy of the stack allocated arrays:

  output.push_back(vector<int>(arrayb,arrayb+32));
  output.push_back(vector<int>(vx,vx+32));

Upvotes: 0

huseyin tugrul buyukisik

Reputation: 11920

there are too many array accessing. You need a prefetch or more locals to represent those refetched array elements. Cache-friendly. Here read this

http://www.research.scea.com/research/pdfs/GDC2003_Memory_Optimization_18Mar03.pdf

Upvotes: 1

Happy Green Kid Naps

Reputation: 1673

Instrument/profile and run your code for say hundred thousand or one million iterations. Identify the parts of your code where substantial execution time is spent. Try and improve the performance of those portions. Repeat. Only when you are satisfied that you can improve no further, should you attempt to run it more than four billion times.

Upvotes: 1

Ben Voigt

Reputation: 283634

Jack has correctly identified that the memory allocation inside the vectors can be a substantial cost. So move the vectors outside the loop, and simply clear() them instead of creating brand new ones.

This will save at least one allocation/deallocation per vector per iteration.

Don't pass vectors by value, instead use const vector<vector<int>>& as the parameter type for ca2step. This will save a whole bunch of vector copies (and memory allocation and deallocation) for each iteration of the inside loop, which is a whole lot.

Inside ca2step, use stack arrays (maybe std::array) instead of vectors. That saves even more dynamic memory allocation. begin(arrayb) will work for both arrays and vectors (instead of arrayb.begin()).

Upvotes: 1

jrad

Reputation: 3190

You could use a LinkedList instead of a vector. LinkedLists have faster insertion (push_back for vectors) since they never need to resize themselves, which, at large numbers, can be a costly operation.

Upvotes: 0

Optimize code for large number of iterations

Answers (10)

Related Questions