Adam Sikora
Adam Sikora

Reputation: 117

Optimizing function call from for loop

I have some simple functions

int f_0(int);
int f_1(int);
...
int f_n(int);

and then I have some for loops in which I call f_i(), the condition in this loops doesnt have to be the same

for (int i = 0; i < n; i++) {
   ...
   if (condition) {
      int myInt = f_i(); // this is not real implementation but shows the result
                         // I want to achieve
      ... //edit
   }
...
}

Here are the ways I tried to implement this:

this is elegant method but in my case it is 4.4 slower than breaking down the loop. Constant pointers to functions yield simmilar results.

Is there any better way how to implement this? Ideal solution would be the one with compact code but the compiler would break down the loop and let the calculations be the fastest.

I´m using MSVC 2012 and running on release mode with optimizations set to maximize speed.

Edit:

Here is my testing code:

head.h

namespace c {
const int w = 1024;
const int A = w * w;
}

inline int f_0(int pos)  { return (pos - c::w + c::A) % c::A;           }
inline int f_1(int pos)  { return (pos + 1 - c::w + c::A) % c::A;       }
inline int f_2(int pos)  { return (pos + 1) % c::A;                     }
inline int f_3(int pos)  { return (pos + c::w) % c::A;                  }
inline int f_4(int pos)  { return (pos - 1 + c::w) % c::A;              }
inline int f_5(int pos)  { return (pos - 1 + c::A) % c::A;              }

typedef int (*NEIGH_F) (int);
typedef int (* const CNEIGH_F) (int);

const NEIGH_F  fptr[]  = { f_0, f_1, f_2, f_3, f_4, f_5 };
const CNEIGH_F cfptr[] = { f_0, f_1, f_2, f_3, f_4, f_5 };

inline int fswitch(int i, int pos) {
    switch(i) {
    case 0 : return f_0(pos); break;
    case 1 : return f_1(pos); break;
    case 2 : return f_2(pos); break;
    case 3 : return f_3(pos); break;
    case 4 : return f_4(pos); break;
    case 5 : return f_5(pos); break;
    default : return -1; break;
    }
}

main.cpp

#include "head.h"
#include <iostream>
#include <time.h>

int main()
{
    int maxRepeat = 100;

    clock_t startTime = clock();
    double sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            sum += f_0(i);
            sum += f_1(i);
            sum += f_2(i);
            sum += f_3(i);
            sum += f_4(i);
            sum += f_5(i);
        }
    std::cout << "normal time:        " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;

    startTime = clock();
    sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            for (int j = 0; j < 6; j++)
                sum += fptr[j](i);
        }
    std::cout << "pointer time:       " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;

    startTime = clock();
    sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            for (int j = 0; j < 6; j++)
                sum += cfptr[j](i);
        }
    std::cout << "const pointer time: " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;

    startTime = clock();
    sum = 0;
    for (int repeat = 0; repeat < maxRepeat; repeat++)
        for (int i = 0; i < c::A; i++) {
            for (int j = 0; j < 6; j++)
                sum += fswitch(j, i);
        }
    std::cout << "switch time:        " << (clock() - startTime)/(double)CLOCKS_PER_SEC
                 << "  sum is: " << sum << std::endl;
    std::cin.ignore();

    return 0;
}

functions f_i are the functions I use in my real implementation, but the loops here are much simpler due to testing purposes in real implementation there are several different loops of form shown in second code snippet in the question.

Edit2:

The form of my loop should stay the same I just want to find the best way how to put f_i into my loops.

Upvotes: 5

Views: 4112

Answers (4)

Iwillnotexist Idonotexist
Iwillnotexist Idonotexist

Reputation: 13457

Are the f_i() functions and the A and w constants truly those given? Because if they are, isn't this problem trivially reducible to a table lookup, an addition and a bitwise AND?

/* Includes */
#include <stdio.h>
#include <time.h>


/* Constants */
const int w = 1024;
const int A = 1024*1024;
const int addconst[6] = {0xFFC00, 0xFFC01, 0x00001, 0x00400, 0x003FF, 0xFFFFF};
                      /*     A-w,   A-w+1,       1,       w,     w-1,     A-1 */

/* THE NOVELTY */
int ftable(int i, int pos){
    return (pos + addconst[i]) & 0xFFFFF;
}

/* Main */
int main(int argc, char* argv[]){
    clock_t timeTaken;
    int     repeat, maxRepeat = 100;
    int     i, j;
    long    sum = 0;

    timeTaken  = -clock();
    for(repeat=0;repeat<maxRepeat;repeat++)
        for(i=0;i<A;i++)
            for(j=0;j<6;j++)
                sum += ftable(j, i);
    timeTaken += clock();

    printf("Stop! Hammertime!        %f  sum is: %f\n",
           timeTaken/(double)CLOCKS_PER_SEC, (double)sum);
    return 0;
}

Please note that when the sum variable is a long, the time taken is:

Stop! Hammertime!        0.348295  sum is: 329853173760000.000000

while when it is a double, it takes more than twice as long:

Stop! Hammertime!        0.861563  sum is: 329853173760000.000000

My compile flags are:

gcc -O3 -funroll-loops -finline-functions tmp.c -o tmp

If you could explain some more how the function index depends on the loop index, I could optimize some more.

Upvotes: 1

Senti Bachcha
Senti Bachcha

Reputation: 78

The following two tweaks radically change the output of results from your program (thanks for the clean compiling code!). These demonstrate that performance optimization has a clear trade-off between build-time vs. run-time uncertainty: you can write more optimal code if you know what function you will be calling, or what target machine you will be running on.

Function call through a pointer gives you the flexibility to call a function at run-time at the cost of not inlining the function calls. Modifying calls to the following makes pointer time equal to normal time.

normal time:        1.36  sum is: 3.29853e+14
pointer time:       1.36  sum is: 3.29853e+14
const pointer time: 1.35  sum is: 3.29853e+14
switch time:        1.14  sum is: 3.29853e+14

Changes were unrolling the function call in the loop, thus:

   sum += fptr[1](i);
   sum += fptr[2](i);
   sum += fptr[3](i);
   sum += fptr[4](i);
   sum += fptr[5](i);

fswitch() is faster than normal for the case you showed perhaps because inlining inside fswitch() creates a set of instructions that get cached. Maybe someone with the requisite expertise could demonstrate this with disassembly of the generated executable. For my test, I enlarged the switch function a bit (by double switch branches by duplicating them as shown below), and found that it runs roughly 4 times slower than normal:

normal time:        2.35  sum is: 6.59706e+14
pointer time:       2.35  sum is: 6.59706e+14
const pointer time: 2.34  sum is: 6.59706e+14
switch time:        9.61  sum is: 6.59706e+14

The changes were:

case 6 : return f_0(pos); break;
case 7 : return f_1(pos); break;
case 8 : return f_2(pos); break;
case 9 : return f_3(pos); break;
case 10 : return f_4(pos); break;
case 11 : return f_5(pos); break;

...

for (int j = 0; j < 12; j++)
    sum += fswitch(j, i);

...

const NEIGH_F  fptr[]  = { f_0, f_1, f_2, f_3, f_4, f_5, f_0, f_1, f_2, f_3, f_4, f_5 };
const CNEIGH_F cfptr[] = { f_0, f_1, f_2, f_3, f_4, f_5, f_0, f_1, f_2, f_3, f_4, f_5 };

...

for (int j = 0; j < 12; j++)
    sum += fptr[j](i);

...

etc.

Upvotes: 2

NicholasM
NicholasM

Reputation: 4673

I think Bryan Chen's template-based solution makes a lot of sense. It would be easier to maintain and understand. I upvoted that solution.

That said, if you wanted a more general solution without a switch statement, and you wanted to test all conditions in an "unrolled" way, you could use compile-time recursion with templates.

I did it with 3 functions, based on Condition functors that take a single integer argument. Obviously, you could make the conditions simpler, or more complicated, according to your needs.

The core of this involves a template defintion that is recursive, plus a template specialization to stop the recursion:

template <int N>
struct Condition;  // provides bool operator()(int arg)

template <int N>
void f();

template <int N>
void applyFunctions(int arg);

// Specialization placed first for clarity
template <>
void applyFunctions<0>(int arg)
{
  if (Condition<0>()(arg))
  {
    f<0>();
  }
  // End recursion
};

template <int N>
void applyFunctions(int arg)
{
  if (Condition<N>()(arg))
  {
    f<N>();
  }

  applyFunctions<N - 1>(arg);
};

Here is some output. The phrases are printed in the condition checks, while the [f<i>] are printed within the function calls. I aligned the printed output for clarity.

Loop
j = 0:                       Is even. [f<1>]       Always true. [f<0>]
j = 1:                                             Always true. [f<0>]
j = 2:  Is prime. [f<2>]     Is even. [f<1>]       Always true. [f<0>]
j = 3:  Is prime. [f<2>]                           Always true. [f<0>]
j = 4:                       Is even. [f<1>]       Always true. [f<0>]
j = 5:  Is prime. [f<2>]                           Always true. [f<0>]
j = 6:                       Is even. [f<1>]       Always true. [f<0>]
j = 7:  Is prime. [f<2>]                           Always true. [f<0>]
j = 8:                       Is even. [f<1>]       Always true. [f<0>]
j = 9:                                             Always true. [f<0>]
j = 10:                      Is even. [f<1>]       Always true. [f<0>]

The full program is below. If you really wanted to do something cool, you could make the Condition struct have a member variable that is calculated in a constexpr way, so that the inclusion of the resulting code is determined at compile time. If that doesn't mean anything to you, you would probably want to read up on templates, template instantiation, and metaprogramming.

#include <iostream>
#include <iomanip>

static int fw = 20;

template <int N>
struct Condition;

template <int N>
void f();


// Specialization 0
template <>
struct Condition<0>
{
  bool operator() (int arg)
  {
    std::cout << std::setw(fw) << " Always true. ";
    return true;
  }
};

template <>
void f<0>()
{
  std::cout << "[f<0>]";
}

// Specialization 1
template <>
struct Condition<1>
{
  bool operator() (int arg)
  {
    bool isEven = (arg % 2 == 0);
    if (isEven)
      std::cout << std::setw(fw) << " Is even. ";
    else 
      std::cout << std::setw(fw) << " ";
    return isEven;
  }
};

template <>
void f<1>()
{
  std::cout << "[f<1>]";
}


// Specialization 2
template <>
struct Condition<2>
{
  bool operator() (int arg)
  {
    bool isPrime = (arg == 2 || arg == 3 || arg == 5 || arg == 7);
    if (isPrime)
      std::cout << std::setw(fw) << " Is prime. ";
    else 
      std::cout << std::setw(fw) << " ";
    return isPrime;
  }
};

template <>
void f<2>()
{
  std::cout<< "[f<2>]";
}


template <int N>
void applyFunctions(int arg);

template <>
void applyFunctions<0>(int arg)
{
  if (Condition<0>()(arg))
  {
    f<0>();
  }
  // End recursion
};

template <int N>
void applyFunctions(int arg)
{
  if (Condition<N>()(arg))
  {
    f<N>();
  }

  applyFunctions<N - 1>(arg);
};


int main()
{
  applyFunctions<2>(4);

  std::cout << std::endl << "Loop" << std::endl;
  for (int j = 0; j < 11; ++j)
  {
    std::cout << "j = " << j << ": ";
    applyFunctions<2>(j);
    std::cout << std::endl;
  }
}

Upvotes: 2

Bryan Chen
Bryan Chen

Reputation: 46578

you can use template function instead of f_0, f_1... nicer to maintain.

template <int N>
void f();

template <>
void f<0>()
{
    printf("f<0>");
}

template <>
void f<1>()
{
    printf("f<1>");
}

int main() {
    f<0>();
    f<1>();
    //f<2>(); // this is compile error
    return 0;
}

however, the template argument must be provided as compile-time constant, so you can't call function like int i = 0; f<i>()

to workaround this, you can use switch-case to call function, not very pretty, but works

void call_f(int i)
{
    switch(i)
    {
        case 0:
            f<0>();
            break;
        case 1:
            f<1>();
            break;
        default:
            // invalid i, report error
            break;
    }
}

however, there is no compile-time check to i

put all together

Upvotes: 4

Related Questions