Progger
Progger

Reputation: 27

C++ running slower than VB6

this is my first post on Stack.

I usually dev in VB6 but have recently started doing more coding in C++ using the DEV-C++ IDE with the g++ compiler lib.

I'm having a problem with general program execution speeds.

This old VB6 code runs in 20 seconds.

DefLng A-Z

Private Sub Form_Load()

    Dim n(10000, 10) As Long
    Dim c(10000, 10) As Long

    For d = 1 To 1000000
        For dd = 1 To 10000
            n(dd, 1) = c(dd, 2) + c(dd, 3)
        Next
    Next

    MsgBox "Done"

End Sub

This C++ code takes 57 seconds...

int main(int argc, char *argv[]) {

    long n[10000][10];
    long c[10000][10];

    for (long d=1;d<1000000;d++){
        for (long dd=1;dd<10000;dd++){
            n[dd][1]=c[dd][2]+c[dd][3];   
        }
    }

    system("PAUSE");
    return EXIT_SUCCESS; }

Most of the coding I do is AI related and very heavy on array usage. I've tried using int rather than long, I've tried different machines, the C++ always runs at least three times slower.

Am I being dumb? Can anyone explain what I'm doing wrong?

Cheers.

Upvotes: 0

Views: 300

Answers (2)

tunglt
tunglt

Reputation: 1072

UPDATE : Recent c++ compilers give much better results compare to VB6 compiler.

The VB6 codes shown above does not reflect the real case, where access index can be varied and calculations can be breaked into embraced functions. More experiences show that VB6 has a huge penaty for optimizing when arrays used were passed as function input (by reference).

I tried severals c++ compilers and rewritten the benchmark codes where some random behaviors was added to trick the optimization. Certainly, the codes could be better, all suggestions are welcome. I used -O3 option for maximizing the speed, multithread mode was not used.

RESULT:

The g++ 8.1 gives the best final result with 5 seconds with const index access (compiler known it, case shown in OP), and 6 seconds with dynamic index access. VC 2017 takes 2 seconds only for const index access but 35 seconds for dynamic case.

VB6 S1 : Original version, n and c are local variables. VB6 S2: inner loops was moved into a function, n and c are input variables passed by reference. Test condition : Intel Xeon W3690 / Windows 10

Here are the test result :

Executions time in seconds The rewritten codes :

VB6:

DefLng A-Z
Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub cal_var(ByRef n() As Long, ByRef c() As Long, id As Long)

        For dd = 0 To 10000
            n(dd, id) = c(dd, id + 1) + c(dd, id + 2)
        Next

End Sub

Public Sub cal_const(ByRef n() As Long, ByRef c() As Long)
        For dd = 0 To 10000
            n(dd, 1) = c(dd, 2) + c(dd, 3)
        Next
End Sub


Private Sub Form_Load()

    Dim n(10001, 10) As Long
    Dim c(10001, 10) As Long

    Dim t0 As Long
    Dim t1 As Long
    Dim t2 As Long

    Dim id As Long
    Dim ret As Long

    t0 = GetTickCount

    For d = 1 To 1000000
        id = d And 7
        Call cal_var(n, c, id) 'For VB S2

        'For dd = 0 To 10000
        '    n(dd, id + 0) = c(dd, id + 1) + c(dd, id + 2)
        'Next

    Next

    t1 = GetTickCount

    For d = 1 To 1000000
        Call cal_const(n, c)  'For VB S2

        'For dd = 0 To 10000
        '    n(dd, 1) = c(dd, 2) + c(dd, 3)
        'Next
    Next

    t2 = GetTickCount

    For d = 0 To 10000
        Sum = Sum + n(d, t0 And 7)
    Next

    MsgBox "Done in " & (t1 - t0) & " and " & (t2 - t1) & " miliseconds"

End Sub

C++ Code:

#include <iostream>
#include <time.h>
#include <string.h>

using namespace std;

#define NUM_ITERATION 1000000
#define ARR_SIZE 10000

typedef long long_arr[ARR_SIZE][10];

void calc_var(long_arr &n, long_arr &c, long id) {
    for (long dd = 0; dd < ARR_SIZE; dd++) {
        n[dd][id] = c[dd][id + 1] + c[dd][id + 2];
    }
}

void calc_const(long_arr &n, long_arr &c) {
    for (long dd = 0; dd < ARR_SIZE; dd++) {
        n[dd][0] = c[dd][1] + c[dd][2];
    }
}

int main()
{
    cout << "start ..." << endl;
    time_t t0 = time(NULL);

    long_arr n;
    long_arr c;

    memset(n, 0, sizeof(n));
    memset(c, t0 & 1, sizeof(c));

    for (long d = 1; d < NUM_ITERATION; d++) {
        calc_var(n, c, (long)t0 & 7);
    }

    time_t t1 = time(NULL);

    for (long d = 1; d < NUM_ITERATION; d++) {
        calc_const(n, c);
    }

    time_t t2 = time(NULL);

    long sum = 0;
    for (int i = 0; i < ARR_SIZE; i++) {
        sum += n[i][t0 & 7];
    }

    cout << "by dynamic index: finished in " << (t1 - t0) << " seconds" << endl;
    cout << "by const index: finished in " << (t2 - t1) << " seconds" << endl;
    cout << "with final result : " << sum << endl;

    return 0;
}

The following original answer was based only on test done in VC2008, VB6 S1 case.

ORIGINAL ANSWER:

I have the same result with Progger. I use the following code to avoid the optimization (code blank) made by the compiler :

int main(int argc, char *argv[]) {

    long n[10000][10];
    long c[10000][10];
    long sum = 0;

    memset(n, 0, sizeof(n) );
    memset(c, 0, sizeof(c) );

    for (long d=1;d<1000000;d++){
        for (long dd=1;dd<10000;dd++){
            n[dd][1]=c[dd][2]+c[dd][3];   
        }
    }

    for (long dd=1;dd<10000;dd++){
        sum += n[dd][1];
    }

    return sum; 
}

I use the visual studio C++ 2008 to compile the code. it turns out that the compiler uses a lot of adress reallocation with local variable, mixing with an imul (multiplication) instruction, that could be very costly, for example :

.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B                 imul    edx, 28h
.text:0040106E                 mov     eax, [ebp+var_C3510]
.text:00401074                 imul    eax, 28h
.text:00401077                 mov     ecx, [ebp+edx+var_C3500]
.text:0040107E                 add     ecx, [ebp+eax+var_C34FC]
.text:00401085                 mov     edx, [ebp+var_C3510]
.text:0040108B                 imul    edx, 28h
.text:0040108E                 mov     [ebp+edx+var_61A7C], ecx

Each access to an index take one multiplication and one add instruction.

EDIT1: The problem is related to multidimension array access performance in C++. More information can be found here:

Array Lookups

C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops.

The VB version use a traditional way (1D optimized flat memory), intructions used are only registers add (eax, edx, esi ...), they are much faster.

.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7                 mov     ecx, [eax+esi*4+13888h]
.text:00401ADE                 mov     edx, [eax+esi*4+1D4CCh]
.text:00401AE5                 add     ecx, edx
.text:00401AE7                 mov     edx, [ebp-2Ch]
.text:00401AEA                 mov     eax, edi

That can answer the question of speed. Advice: you should use existed libraries (for example: uBlas) if possible. More discussion can be found here.

Here is the machine code of C++ version :

.text:00401000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:00401000 _main           proc near               ; CODE XREF: ___tmainCRTStartup+F6p
.text:00401000
.text:00401000 var_C3514       = dword ptr -0C3514h
.text:00401000 var_C3510       = dword ptr -0C3510h
.text:00401000 var_C350C       = dword ptr -0C350Ch
.text:00401000 var_C3500       = dword ptr -0C3500h
.text:00401000 var_C34FC       = dword ptr -0C34FCh
.text:00401000 var_61A84       = dword ptr -61A84h
.text:00401000 var_61A7C       = dword ptr -61A7Ch
.text:00401000 argc            = dword ptr  8
.text:00401000 argv            = dword ptr  0Ch
.text:00401000 envp            = dword ptr  10h
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 mov     eax, 0C3514h
.text:00401008                 call    __alloca_probe
.text:0040100D                 mov     [ebp+var_61A84], 0
.text:00401017                 mov     [ebp+var_C350C], 1
.text:00401021                 jmp     short loc_401032
.text:00401023 ; ---------------------------------------------------------------------------
.text:00401023
.text:00401023 loc_401023:                             ; CODE XREF: _main:loc_401097j
.text:00401023                 mov     eax, [ebp+var_C350C]
.text:00401029                 add     eax, 1
.text:0040102C                 mov     [ebp+var_C350C], eax
.text:00401032
.text:00401032 loc_401032:                             ; CODE XREF: _main+21j
.text:00401032                 cmp     [ebp+var_C350C], 0F4240h
.text:0040103C                 jge     short loc_401099
.text:0040103E                 mov     [ebp+var_C3510], 1
.text:00401048                 jmp     short loc_401059
.text:0040104A ; ---------------------------------------------------------------------------
.text:0040104A
.text:0040104A loc_40104A:                             ; CODE XREF: _main+95j
.text:0040104A                 mov     ecx, [ebp+var_C3510]
.text:00401050                 add     ecx, 1
.text:00401053                 mov     [ebp+var_C3510], ecx
.text:00401059
.text:00401059 loc_401059:                             ; CODE XREF: _main+48j
.text:00401059                 cmp     [ebp+var_C3510], 2710h
.text:00401063                 jge     short loc_401097
.text:00401065                 mov     edx, [ebp+var_C3510]
.text:0040106B                 imul    edx, 28h
.text:0040106E                 mov     eax, [ebp+var_C3510]
.text:00401074                 imul    eax, 28h
.text:00401077                 mov     ecx, [ebp+edx+var_C3500]
.text:0040107E                 add     ecx, [ebp+eax+var_C34FC]
.text:00401085                 mov     edx, [ebp+var_C3510]
.text:0040108B                 imul    edx, 28h
.text:0040108E                 mov     [ebp+edx+var_61A7C], ecx
.text:00401095                 jmp     short loc_40104A
.text:00401097 ; ---------------------------------------------------------------------------
.text:00401097
.text:00401097 loc_401097:                             ; CODE XREF: _main+63j
.text:00401097                 jmp     short loc_401023
.text:00401099 ; ---------------------------------------------------------------------------
.text:00401099
.text:00401099 loc_401099:                             ; CODE XREF: _main+3Cj
.text:00401099                 mov     [ebp+var_C3514], 1
.text:004010A3                 jmp     short loc_4010B4
.text:004010A5 ; ---------------------------------------------------------------------------
.text:004010A5
.text:004010A5 loc_4010A5:                             ; CODE XREF: _main+DCj
.text:004010A5                 mov     eax, [ebp+var_C3514]
.text:004010AB                 add     eax, 1
.text:004010AE                 mov     [ebp+var_C3514], eax
.text:004010B4
.text:004010B4 loc_4010B4:                             ; CODE XREF: _main+A3j
.text:004010B4                 cmp     [ebp+var_C3514], 2710h
.text:004010BE                 jge     short loc_4010DE
.text:004010C0                 mov     ecx, [ebp+var_C3514]
.text:004010C6                 imul    ecx, 28h
.text:004010C9                 mov     edx, [ebp+var_61A84]
.text:004010CF                 add     edx, [ebp+ecx+var_61A7C]
.text:004010D6                 mov     [ebp+var_61A84], edx
.text:004010DC                 jmp     short loc_4010A5
.text:004010DE ; ---------------------------------------------------------------------------
.text:004010DE
.text:004010DE loc_4010DE:                             ; CODE XREF: _main+BEj
.text:004010DE                 mov     eax, [ebp+var_61A84]
.text:004010E4                 mov     esp, ebp
.text:004010E6                 pop     ebp
.text:004010E7                 retn
.text:004010E7 _main           endp

The VB code is longer, so I post here only the main function :

.text:00401A81 loc_401A81:                             ; CODE XREF: .text:00401B18j
.text:00401A81                 mov     ecx, [ebp-68h]
.text:00401A84                 mov     eax, 0F4240h
.text:00401A89                 cmp     ecx, eax
.text:00401A8B                 jg      loc_401B1D
.text:00401A91                 mov     edx, [ebp-18h]
.text:00401A94                 mov     edi, 1
.text:00401A99                 add     edx, 1
.text:00401A9C                 mov     esi, edi
.text:00401A9E                 jo      loc_401CEC
.text:00401AA4                 mov     [ebp-18h], edx
.text:00401AA7
.text:00401AA7 loc_401AA7:                             ; CODE XREF: .text:00401B03j
.text:00401AA7                 mov     eax, 2710h
.text:00401AAC                 cmp     esi, eax
.text:00401AAE                 jg      short loc_401B05
.text:00401AB0                 mov     ebx, ds:__vbaGenerateBoundsError
.text:00401AB6                 cmp     esi, 2711h
.text:00401ABC                 jb      short loc_401AC0
.text:00401ABE                 call    ebx ; __vbaGenerateBoundsError
.text:00401AC0 ; ---------------------------------------------------------------------------
.text:00401AC0
.text:00401AC0 loc_401AC0:                             ; CODE XREF: .text:00401ABCj
.text:00401AC0                 cmp     esi, 2711h
.text:00401AC6                 jb      short loc_401AD4
.text:00401AC8                 call    ebx ; __vbaGenerateBoundsError
.text:00401ACA ; ---------------------------------------------------------------------------
.text:00401ACA                 cmp     esi, 2711h
.text:00401AD0                 jb      short loc_401AD4
.text:00401AD2                 call    ebx
.text:00401AD4
.text:00401AD4 loc_401AD4:                             ; CODE XREF: .text:00401AC6j
.text:00401AD4                                         ; .text:00401AD0j
.text:00401AD4                 mov     eax, [ebp-54h]
.text:00401AD7                 mov     ecx, [eax+esi*4+13888h]
.text:00401ADE                 mov     edx, [eax+esi*4+1D4CCh]
.text:00401AE5                 add     ecx, edx
.text:00401AE7                 mov     edx, [ebp-2Ch]
.text:00401AEA                 mov     eax, edi
.text:00401AEC                 jo      loc_401CEC
.text:00401AF2                 add     eax, esi
.text:00401AF4                 mov     [edx+esi*4+9C44h], ecx
.text:00401AFB                 jo      loc_401CEC
.text:00401B01                 mov     esi, eax
.text:00401B03                 jmp     short loc_401AA7
.text:00401B05 ; ---------------------------------------------------------------------------
.text:00401B05
.text:00401B05 loc_401B05:                             ; CODE XREF: .text:00401AAEj
.text:00401B05                 mov     ecx, [ebp-68h]
.text:00401B08                 mov     eax, 1
.text:00401B0D                 add     eax, ecx
.text:00401B0F                 jo      loc_401CEC
.text:00401B15                 mov     [ebp-68h], eax
.text:00401B18                 jmp     loc_401A81
.text:00401B1D ; ---------------------------------------------------------------------------
.text:00401B1D
.text:00401B1D loc_401B1D:                             ; CODE XREF: .text:00401A8Bj
.text:00401B1D                 mov     edi, 1
.text:00401B22                 mov     ebx, 2710h
.text:00401B27                 mov     esi, edi
.text:00401B29
.text:00401B29 loc_401B29:                             ; CODE XREF: .text:00401B5Fj
.text:00401B29                 cmp     esi, ebx
.text:00401B2B                 jg      short loc_401B61
.text:00401B2D                 cmp     esi, 2711h
.text:00401B33                 jb      short loc_401B3B
.text:00401B35                 call    ds:__vbaGenerateBoundsError
.text:00401B3B ; ---------------------------------------------------------------------------
.text:00401B3B
.text:00401B3B loc_401B3B:                             ; CODE XREF: .text:00401B33j
.text:00401B3B                 mov     ecx, [ebp-2Ch]
.text:00401B3E                 mov     eax, [ebp-40h]
.text:00401B41                 mov     edx, [ecx+esi*4+9C44h]
.text:00401B48                 add     edx, eax
.text:00401B4A                 mov     eax, edi
.text:00401B4C                 jo      loc_401CEC
.text:00401B52                 add     eax, esi
.text:00401B54                 mov     [ebp-40h], edx
.text:00401B57                 jo      loc_401CEC
.text:00401B5D                 mov     esi, eax
.text:00401B5F                 jmp     short loc_401B29
.text:00401B61 ; ---------------------------------------------------------------------------
.text:00401B61
.text:00401B61 loc_401B61:                             ; CODE XREF: .text:00401B2Bj
.text:00401B61                 mov     ebx, ds:__vbaStrI4
.text:00401B67                 mov     ecx, 80020004h
.text:00401B6C                 mov     [ebp-0B4h], ecx
.text:00401B72                 mov     [ebp-0A4h], ecx
.text:00401B78                 mov     [ebp-94h], ecx
.text:00401B7E                 mov     ecx, [ebp-40h]
.text:00401B81                 mov     eax, 0Ah
.text:00401B86                 push    offset aDone    ; "Done : "

Upvotes: 0

Andrei Diea
Andrei Diea

Reputation: 56

Short answer

You need to look into your compiler optimization settings. This resource might help

Takeaway: C++ allows you to use many tricks some generic and some dependent on your architecture, when used properly it will be superior to VB in terms of performance.

Long answer

Keep in mind this is highly dependent on your architecture and compiler, also compiler settings. You should configure your compiler to do a more agressive optimization. Also you should write optimized code taking into account memory access, using CPU cache wisely etc.

I have done a test for you on an ubuntu 16.04 virtual machine using a core of Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz. Using the code bellow here are my times depending on the optimization level of the compiler I used g++ 5.4.0

I'm using optimization level 0,1,2,3,s and obtain 36s(completely unoptimized), 23s, and then.. zero.

osboxes@osboxes:~/test$ g++ a.cpp -O0 -o a0 osboxes@osboxes:~/test$ ./a0 start..finished in 36174855 micro seconds osboxes@osboxes:~/test$ g++ a.cpp -O1 -o a1 osboxes@osboxes:~/test$ ./a1 start..finished in 2352767 micro seconds osboxes@osboxes:~/test$ g++ a.cpp -O2 -o a2 osboxes@osboxes:~/test$ ./a2 start..finished in 0 micro seconds osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3 osboxes@osboxes:~/test$ ./a3 start..finished in 0 micro seconds osboxes@osboxes:~/test$ g++ a.cpp -Os -o as osboxes@osboxes:~/test$ ./as start..finished in 0 micro seconds

Note that by using a more agressive optimization level, the compiler will eliminate the code completely because the values in n[] are not used in the program. To force the compiler to generate code use the volatile keyword when declaring n

With the volatile added now you'll have ~12s with the most agressive optimization (on my machine)

osboxes@osboxes:~/test$ g++ a.cpp -O3 -o a3 osboxes@osboxes:~/test$ ./a3 start..finished in 12139348 micro seconds osboxes@osboxes:~/test$ g++ a.cpp -Os -o as osboxes@osboxes:~/test$ ./as start..finished in 12493927 micro seconds

The code I used for the test(based on your example)

#include <iostream>
#include <sys/time.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
  struct timeval now;
  gettimeofday(&now, NULL);
  return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
  cout<<"start"<<endl;
  u64 t0 = timestamp();

  volatile long n[10000][10];
  long c[10000][10];

  for(long d=1;d<1000000;d++)
  {
    for(long dd=1;dd<10000;dd++)
    {
      n[dd][1]=c[dd][2]+c[dd][3];
  }
}

u64 t1 = timestamp();

cout<<"..finished in "<< (t1-t0) << " micro seconds\n";

return 0;
}

Multithreading

I have converted your code to use multithreading, using 2 threads I am able to reduce the time by half.

I am using the fact that as it is now, the results are not used so the inner for is not dependent on the outer one, in reality you should find another way to split the work so that the results do not overwrite one another.

#include <iostream>
#include <sys/time.h>
#include <omp.h>
using namespace std;

typedef unsigned long long u64;

u64 timestamp()
{
  struct timeval now;
  gettimeofday(&now, NULL);
  return now.tv_usec + (u64)now.tv_sec*1000000;
}

int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
}

cout<<"start"<<endl;
u64 t0 = timestamp();

volatile long n[10000][10];
long c[10000][10];

for(long d=1;d<1000000;d++)
{
#pragma omp parallel for
    for(long dd=1;dd<10000;dd++)
    {
      n[dd][1]=c[dd][2]+c[dd][3];
    }
}

u64 t1 = timestamp();
cout<<"..finished in "<< (t1-t0) << " micro seconds\n";
return 0;
}

osboxes@osboxes:~/test$ g++ a.cpp -O3 -fopenmp -o a3 osboxes@osboxes:~/test$ ./a3 start..finished in 6673741 micro seconds

Upvotes: 3

Related Questions