Reputation: 460

Calculate difference between multiples of two different numbers

This is an algorithmic problem. To keep it simple, say I have two doubles, A and B. I want to construct a function that will give me the difference until the next multiple of A or the next multiple of B, if that makes sense.

For instance, say A is 3 and B is 5.

Consider the multiples: (3,6,9,12,15) and (5,10,15).

I'd want the function to output: (3, 2, 1, 3, 1, 2, 3), since it takes 3 units to get to 3, then 2 more to get to 5, then 1 to 6, then 3 to 9, etc...

I hope this makes sense. Ideally it's a Python-esque generator (although I'm writing this in Arduino ~ C++). I need it to be fast - really fast.

Any help would really be appreciated. My pseudocode is below, but it's not that great.

a = 3
b = 5

current = 0
distToA = a
distToB = b
for i in xrange(100):
  if distToA > distToB: #B comes first
    print "Adding {0}".format(distToB)
    current += distToB
    distToA -= distToBb
    distToB = b
  elif distToB > distToA: #A comes first
    print "Adding {0}".format(distToA)
    current += distToA
    distToB -= distToA
    distToA = a
  else: #Equal
    print "Adding {0}".format(distToA)
    current += distToA #Arbitrarily, could be distToB
    distToA = a
    distToB = b

EDIT: How would this look with multiple values? Not just a and b, but also c, d, e, etc.. I'd imagine I'd just do some more if statements, but the cost is higher (more operations per branch).

Upvotes: 10

Answers (3)

acarlon

Reputation: 17272

Let start with some general points. It is pretty much always better to start out with intuitive code that will be understood by you and your coworkers. Then measure the performance and find bottlenecks. If you try to hyper-optimize from the outset, you will: -

make code that is complicated, error prone and less understandable.
most likely optimise code that would barely register a blip on the overall performance, while overlooking major bottlenecks. Unless you know the processor, compiler, programming language and environmental nuances back to front there is a good chance you will make the performance worse if you try to guess the optimizations.

It is best to measure, find bottlenecks, then improve the performance for those bottlenecks. If you suspect an algorithm / implementation is slow, then profile it. If you are wondering which algorithm / implementation will perform best, then race them. Test with varying data sets because what performs well for one set of input (3,5) might not for another (3, 500000000).

Having said that, lets start with what you have and explore some options and then finally provide a starting implementation for the case that you described in your edit, i.e. multiple values. Please note that some of these implementations might not be ideal for your case or apply to your environment, but they touch on a general approach.

Status Quo (your code as-is)

This code does a few conditional and arithmetic operations. These are the kinds of operations that processors eat for breakfast...before they wake up...in the blink of a nanopartical's eyelid, i.e. very fast. Now, I know that you are using Arduino and so will not have the most powerful processor in the world to play with, but still, these are the operations that processors do very quickly. I wanted to create some benchmarks of my own, so I implemented a very similar function to yours in C++ (you mentioned that C++ is OK in your question). I called the test ConditionalTest because it follows an if...else flow and because I am bad at names.

Note: while I have done some rudimentary tests on the results, the code provided in these answers is by no means production ready. It is missing basic parameter checks (such as null values or uninitialised variables) and has some performance optimizations that I would normally omit in preference for safety. Anyway, the code is:

static void ConditionalTest( int startA, int startB, unsigned long long testIterations )
{       
    gl_result = 0;      
    gl_current=0;
    int distToA = startA;
    int distToB = startB;

    for( unsigned long long i = 0; i < testIterations; i++ )
    {           
        if( distToA > distToB ) //B comes first
        {           
            gl_result = distToB;
            gl_current += distToB;
            distToA -= distToB;
            distToB = startB;               
        }
        else if( distToB > distToA ) //A comes first
        {       
            gl_result = distToA;
            gl_current += distToA;
            distToB -= distToA;
            distToA = startA;                               
        }
        else
        {       
            gl_result = distToA; 
            gl_current += distToA; //Arbitrarily, could be distToB
            distToA = startA;
            distToB = startB;
        }
    }    
}

Note: -

I assign the value to a global gl_result rather than printing it to save on filling up my console with messages and also because the operation of printing to screen takes ages compared to the other operations and so it would bloat the results.
I had to use unsigned long long for testIterations and some other variables because otherwise int would wrap around.
gl_ are global variables in the test.

The benefit of this algorithm is that it uses very basic constructs, so

other programmers with even a very basic understanding of programming or from other programming languages will quickly understand what it is doing.
it is very portable - it is easy to translate to other languages and operating systems.
regarding performance, is wysiwyg - what you see is what you get, so it is unlikely that there are big performance bottlenecks hidden in 3rd party library calls.

Now, I am running a reasonably grunty machine (i7 2600) so it took 1000000000 (1 billion) iterations to start getting results that took more than a second. In this case, it took on average 2400 milliseconds to do 1 billion iterations. I think that is pretty quick, but lets look at how we can improve on things. First lets see what we can tweak.

A tweak to your implementation

The arguments are (3,5), so initially distA is 3 and distB is 5. Note that 3 is smaller than 5. The first if will check if distToA > distToB: then elif distToB > distToA:. However, distToB (initially 5) is twice as likely to be greater than distToA (initially 3). For performance, you want the first if condition to be satisfied as often as possible in order to minimize the number of conditions that are checked in each iteration. In saying this I am making some assumptions about the compiler, but more on that later.

So, very simply, we can swap the ifs around. However, it is not that simple. The problem I found with this is that the compiler is doing some nice optimisations on the second if and last else. You see where you had the comment Arbitrarily, could be distToB? Well, the fact that you have gl_current += distToA; in the else if and gl_current += distToA in the else allowed the compiler to optimise this to one statement. So, in my case it is not arbitrary (for you it will depend on your compiler). So, we need to change the else to allow these optimizations to occur. The final code is:

static void ConditionalTestV2( int startA, int startB, unsigned long long testIterations )
{       
    gl_result = 0;      
    gl_current=0;
    int distToA = startA;
    int distToB = startB;

    for( unsigned long long i = 0; i < testIterations; i++ )
    {                       
        if( distToB > distToA ) //A comes first (where a is more likely to be first than b)
        {       
            gl_result = distToA;
            gl_current += distToA;
            distToB -= distToA;
            distToA = startA;                               
        }
        else if( distToA > distToB ) //B comes first
        {           
            gl_result = distToB;
            gl_current += distToB;
            distToA -= distToB;
            distToB = startB;               
        }
        else
        {       
            gl_result = distToB;   //Should be distToB for optimisations
            gl_current += distToB; //Should be distToB for optimisations
            distToA = startA;
            distToB = startB;
        }
    }      
}

Note: if( distToB > distToA ) is before else if( distToA > distToB ) and that the else now has gl_result = distToB and gl_current += distToB. With those changes in place, the time that the test took to run was: 2108 milliseconds. It is nice that those simple tweaks gave a 12% reduction in execution time.

The biggest lesson from this is to measure any change that you make for unintended consequences.

Your compiler and execution environment may differ from mine, so your results may vary. If you are going to start tweaking things at this level, I would suggest becoming familiar with assembler and stepping through the assembly at the critical points to determine how the conditions are actually being implemented. I am sure that there are other optimizations such as these that can be made. If you really get into it and are using GNU C++, there is something called __builtin_expect where you can guide the compiler about which branch to favour.

You may not always get the starting values in order, in which case you would need to way the cost of a one-off sort against the overall time of executing your algorithm.

Some other things to point out are: -

You maintain a variable current, but you do not use it. If you are not using current, then you can remove it. You might not see a performance gain if the compiler already optimised it out.
You have a range of 100, but the cycle will repeat ever 3 * 5 = 15 times. So, you could either stop when current is 15 if that is all you need or you could store the results and then just write them out (see the patterns section).

Modulo

Looking at the algorithm, we are always getting the distance to a value, so one approach that springs to mind is modulo (there is already an answer to cover this). I am a bit suspicious of the performance because modulo tends to use division which is slower than your subtraction operations. Anyway, this is what I came up with:

static void ModuloTest( int startA, int startB, unsigned long long testIterations )
{   
    unsigned long long current = 0;
    unsigned long long prev = 0;
    int distToA = startA;
    int distToB = startB;

    for( long long i = 0; i < testIterations; i++ )
    {       
        current += (gl_result = FastMin(distToA - (current%distToA), distToB - (current%distToB)));
    }
}

The result was 23349 milliseconds. Almost 10 times slower than your original.

Now, I normally wouldn't write a line such as the one that has current += (gl..., but I was trying to reduce the number of assignments. This is generally, a silly thing to do, because the compiler will optimise better than me and also it is more error prone. Still, this test was quite a bit slower and I wanted to make sure I gave it a good chance. It is a bit unfair to start pointing the finger at modulo straight away as the flow is a bit different, so maybe something else is to blame. So, I made an even simpler modulo test:

static void PureModuloTest( unsigned long long testIterations, unsigned long long mod )
{
    for(long long i = 1; i <= testIterations; i++)
    {
        gl_result = mod % i;
    }
}

where mod was 50000 and even in this case the test took 5 times longer than your test, so I think that modulo is out if we are looking for a pure performance gain. I also found some surprising inefficiencies with the stl min(), but to go into detail would make this long post even longer.

The next thing I did was looked at the data. Sometimes if you can find characteristics / patterns in the data you can optimise your implementation accordingly.

Patterns

Looking at your data again, something that jumps out is that the differences will repeat every a * b cycles. So, in your test, once you get to 15 the distances will repeat. You are probably already aware of this, but in your code snippet you run the test for 100 cycles (for i in xrange(100)) so I wasn't sure.

One way to use this fact is to store the values until we get to a * b and then just reuse the values until we are done. Note that this essentially a matter of using your algorithm to begin with and then iterating through a list from then on.

static void PatternTest( int startA, int startB, unsigned long long testIterations )
{
    int stop = startA * startB;
    list<int> resultList;
    int distToA = startA;
    int distToB = startB;   
    int val = 0;
    long long count = 0;
    while( val < stop  )
    {   
        if( distToB > distToA ) //A comes first (where a is more likely to be first than b)
        {       
            gl_result = distToA;                
            distToB -= distToA;
            distToA = startA;                               
        }
        else if( distToA > distToB ) //B comes first
        {           
            gl_result = distToB;                
            distToA -= distToB;
            distToB = startB;               
        }
        else
        {       
            gl_result = distToB;                
            distToA = startA;
            distToB = startB;
        }
        val += gl_result;
        resultList.push_back(gl_result);
        count++;
    }
    std::list<int>::const_iterator iter;
    while( count < testIterations )
    {
        for( iter = resultList.begin(); iter != resultList.end() && count < testIterations; iter++ )
        {
            gl_result = *iter;
            count++;
        }       
    }
}

This test took 1711 milliseconds, around 29% faster than the original and about 18% faster than the current best. I am not sure how applicable this is in your case, but it is an example of how analyzing the expected data can provide some good performance gains.

Thread Bonanza!

Now, this probably doesn't apply in your case since you are working with Arduino. But maybe threads will be supported in future or maybe you can farm the problem out to different processors. Either way, it would be unkind not to include a threading benchmark since this is what they live for. Also, my computer has 8 cores, 7 of which spend their time lazing about, so it is nice to give them a chance to run wild.

If your data or algorithm can be broken into independent discrete parts, then you could design your program so that it runs independent operations on separate threads. Now we know from before that the sequence repeats every a * b. So, we could start different points n where '(n modulo (a * b)) == 0'.

But, we could do better and first get the values for the first a * b and then loop through the values on the separate threads. Which is what I have done here. I chose to run 4 threads.

struct BonanzaThreadInfo
{
    long long iterations;
    list<int> resultList;
    int result;
};

static void BonanzaTestThread( void* param )
{
    BonanzaThreadInfo* info = (BonanzaThreadInfo*)param;    

    std::list<int>::const_iterator iter;
    for( long long count = 0; count < info->iterations; )
    {
        for( iter = info->resultList.begin(); iter != info->resultList.end() && count < info->iterations; iter++ )
        {
            info->result = *iter;           
            count++;
        }   
    }
    delete param;
}

static void ThreadBonanzaTest( int startA, int startB, unsigned long long testIterations )
{   
    int stop = startA * startB;
    list<int> resultList;
    int distToA = startA;
    int distToB = startB;   
    int val = 0;
    long long count = 0;
    while( val < stop  )
    {       
        if( distToB > distToA ) //A comes first (where a is more likely to be first than b)
        {       
            gl_result = distToA;                
            distToB -= distToA;
            distToA = startA;                               
        }
        else if( distToA > distToB ) //B comes first
        {           
            gl_result = distToB;                
            distToA -= distToB;
            distToB = startB;               
        }
        else
        {       
            gl_result = distToB;                
            distToA = startA;
            distToB = startB;
        }
        val += gl_result;
        resultList.push_back(gl_result);
        count++;
    }
    long long threadIterations = (testIterations - count) / NUMTHREADS;
    long long iterationsLeft = testIterations-count;
    thread* bonanzaThreads = new thread[NUMTHREADS];
    for( int i = 0; i < NUMTHREADS; i++ )
    {
        BonanzaThreadInfo* bonanzaThreadInfo = new BonanzaThreadInfo;
        if( i == (NUMTHREADS - 1) )
        {
            bonanzaThreadInfo->iterations = iterationsLeft;
        }
        else
        {
            iterationsLeft -= threadIterations;
            bonanzaThreadInfo->iterations = (threadIterations);
        }       
        bonanzaThreadInfo->resultList = resultList;
        bonanzaThreads[i] = thread(BonanzaTestThread,bonanzaThreadInfo);//http://stackoverflow.com/a/10662506/746754        
    }
    for( int i = 0; i < NUMTHREADS; i++ )
    {
        bonanzaThreads[i].join();
    }
    delete [] bonanzaThreads;
}

The result is that this took 574 milliseconds. A whopping saving of 76%! Some basic points to note about threads: -

The complexity and room for error increases dramatically.
If there is any shared resource between the threads, then that resource will need to be protected by a mutex. If threads frequently need the same protected resource at the same time, then all threads that need that resource will need to wait until it is availabe which can result if very poor performance.

Here is a graph of where we are up to so far:

results

Now, to your edit about multiple values.

Multiple Values

Well, as far as I can see, if you have multiple input values (a,b,c,d...) your if statements are going to become very nested and length very quickly. if a < b && a < c && a < d...

We are generally trying to order the next values, so that is where I would start. My first thought is to store the values in some ordered data structure. I chose to use a set because a set is naturally ordered by a key (actually it is a multiset because we need to allow dupes). Inside the set, I put a struct (called ValuesStruct because I am very bad at names) that contains the value to increment by (a,b,c) as well as the next integer where this value will be the closest. The < operator is so that stl knows where to put this value in the set.

struct ValuesStruct
{
public:
    int Value;
    long long Next;
    ValuesStruct( int start )
    {
        Value = start;
        Next = start;
    }
    bool operator < (const ValuesStruct& rOther) const
    {
        return (Next < rOther.Next);
    }
private:
    ValuesStruct()
    {

    }
};

Then, all I need to do is iterate through the set. On each iteration, the front of the set will have the minumum next value. So I can calculate the current interval by subtracting the previous from this. Then, I just need a do..while() loop to remove this value from the list and add it back in with the updated Next value, so that it will take the appropriate position in the set. I need to do it for all values that had this as Next (e.g. as would be the case at 15 for your simple 3,5 example). I called the test MultiConditionalTest because here we need to check multiple comparison conditions and because I am so bad at names.

static void MultiConditionalTest( multiset<ValuesStruct>& values, unsigned long long testIterations )
{               
    unsigned long long prev = 0;
    for( unsigned long long i = 0; i < testIterations; i++ )
    {
        multiset<ValuesStruct>::iterator iter = values.begin();     
        gl_result = (*(iter)).Next - prev;
        prev = (*(iter)).Next;
        do //handle case where equal
        {
            ValuesStruct valuesStruct = *iter;
            values.erase(iter);
            valuesStruct.Next += valuesStruct.Value;
            values.insert( valuesStruct );
            iter = values.begin();
        }while( (*iter).Next == prev );
    }
}

The function is used as follows:

multiset<ValuesStruct> values;
values.insert(ValuesStruct(3));
values.insert(ValuesStruct(5));
values.insert(ValuesStruct(7));
values.insert(ValuesStruct(11));
MultiConditionalTest( values, testIterations );

As you can see, there is a lot going on here, so I expected a bit of a performance blowout and got it: 105156 milliseconds - about 50 times slower. This is still less than a microsecond per iteration, so again it depends on what you are aiming at. Since I just banged this up this evening without analyzing it much I am pretty sure there are performance optimizations that can be made. First, the set is normally implemented as a binary search tree. I would do some research and determine whether this is the best data structure for this problem. Also, when inserting a new value into the list a hint can be given as to where it will be placed. If we are clever about choosing the position then we might be able to speed this operation up. Also, as before, the sequence will repeat when we get to (a * b * c * d...), so we could store the values and then just write them out from then on. I would also look at the problem space and see if there is a way to optimise the algorithm, possibly ask about the mathematical sequence on math.stackexchange.com - those guys are pretty sharp.

Anyways, this is just an option, it may or may not work for you depending on your real performance requirements.

Some other thoughts:

How likely are you to get the same set of values (a,b,c,d...)? If this is likely, you may want to cache previous results. Then it would just be a matter of reading them from a cached array which would be very quick.
Another way to improve performance is to turn on compiler optimisations. How you do this and how effective it is depends on your compiler.

Good luck.

Upvotes: 2

Tim Peters

Reputation: 70705

Unclear why you're unhappy with the code you have. If it's because there are "so many " if tests, it's easy enough to do it with none:

def diffgen(a, b):
    from itertools import cycle
    diffs = []
    current = 0
    ab = a*b
    while current < ab:
        nextone = min((current // a + 1) * a,
                      (current // b + 1) * b)
        diffs.append(nextone - current)
        yield nextone - current
        current = nextone
    for d in cycle(diffs):
        yield d

Note that once you reach a*b, the sequence of diffs repeats, so no more calculations are needed then.

Upvotes: 3

Matt Woelk

Reputation: 2038

Here's a way to do it using the modulo operation:

a = 3
b = 5
current = 0

def nearest_multiple_of_a_or_b_to_current(current, a, b):
    distance_to_a = (a - current%a)
    distance_to_b = (b - current%b)
    return current + min(distance_to_a, distance_to_b)

for i in range(100):
    next = nearest_multiple_of_a_or_b_to_current(current, a, b)
    print(next - current)
    current = next

Output:

Upvotes: 1

Calculate difference between multiples of two different numbers

Answers (3)

Related Questions