C++ Perfomance Per Compiler, 200 times slower than C#

Question

I was dealing with some performance issues which I discussed in this question: Super Slow C++ For Loop

I have a simple program I wrote to parse binary data. I tested it locally on 2 computers.

1. Dual 6 core 2.4GHz Xeon V3, 64GB RAM, NVMe SSD
2. Dual 4 core 3.5GHz Xeon V3, 64GB RAM, NVMe SSD

Here is some of the code(rest is on Wandbox https://wandbox.org/permlink/VIvardJNAMKzSbMf):

string HexRow="";
for (int i=b; i BufferedLine=HexToBytes(HexRow);
stopwatch<> sw;
for (int i = 0; 80 >= i; ++i)
{
    Byte ColumnBytes;
    for (auto it = columns["data"][i].begin(); it != columns["data"][i].end(); ++it)
    {
        try {
            if (it.key() == "Column") { ColumnBytes.Column = it.value().get();}
            else if (it.key() == "DataType") { ColumnBytes.DataType = it.value().get();}
            else if (it.key() == "StartingPosition") { ColumnBytes.StartingPosition = it.value().get();}
            else if (it.key() == "ColumnWidth") { ColumnBytes.ColumnWidth = it.value().get();}
        }
        catch (...) {}
    }

    char* locale = setlocale(LC_ALL, "UTF-8");
    std::vector CurrentColumnBytes(ColumnBytes.ColumnWidth);
    int arraySize = CurrentColumnBytes.size();

    for (int C = ColumnBytes.StartingPosition; C < ColumnBytes.ColumnWidth + ColumnBytes.StartingPosition; ++C)
    {
        int Index = C - ColumnBytes.StartingPosition;
        CurrentColumnBytes[Index] = BufferedLine[C-1];
    }
}
std::cout << "Elapsed: " << duration_cast(sw.elapsed()) << '
';

PC 1

Compiling on PC 1 with Visual Studio using the following flags:

/O2 /JMC /permissive- /MP /GS /analyze- /W3 /Zc:wchar_t /ZI /Gm- /sdl /Zc:inline /fp:precise /D "_CRT_SECURE_NO_WARNINGS" /D "_MBCS" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /MDd /std:c++17 /FC /Fa"Debug" /EHsc /nologo /Fo"Debug" /Fp"Debug\Project1.pch" /diagnostics:column

Output:

Elapsed: 0.0913771
Elapsed: 0.0419886
Elapsed: 0.042406

Using Clang with the following: clang main.cpp -O3 outputs:

Elapsed: 0.036262
Elapsed: 0.0174264
Elapsed: 0.0170038

Compiling with GCC from MinGW gcc version 8.1.0 (i686-posix-dwarf-rev0, Built by MinGW-W64 project) using these switches gcc main.cpp -lstdc++ -O3 gives the following time:

Elapsed: 0.019841
Elapsed: 0.0099643
Elapsed: 0.0094552

PC 2

I get with Visual Studio, still with the /O2

Elapsed: 0.054841
Elapsed: 0.03543
Elapsed: 0.034552

I didn't do Clang and GCC on PC 2, but the improvement wasn't significant enough to resolve my concerns.

Wandbox

The issue is that the exact same code on Wandbox (https://wandbox.org/permlink/VIvardJNAMKzSbMf) executes 10-80 times faster

Elapsed: 0.00115457
Elapsed: 0.000815412
Elapsed: 0.000814636

Wandbox is using GCC 10.0.0 and c++14. I realize it is likely running on linux, and I couldn't find any way to get GCC 10 to compile on Windows, so I can't test compiling with that version.

C# - 200X Faster

This is a rewrite of a C# application I wrote, which operates so much faster:

Elapsed: 0.017424 
Elapsed: 0.0006065 
Elapsed: 0.000733 
Elapsed: 0.0006166 
Elapsed: 0.0004699 

Finished Parsing: 100 Records. Elapsed :0.0082796 at a rate of : 12076/s

The C# Method looks like this:

Stopwatch sw = new Stopwatch();
sw.Start();
foreach (dynamic item in TableData.data)  //TableData is a JSON file with the structure definition
{

    string DataType = item.DataType;
    int startingPosition = item.StartingPosition;

    int width = Convert.ToInt32(item.ColumnWidth);
    if (width+startingPosition >= FullLineLength)
    {
        continue;
    }

    byte[] currentColumnBytes = currentLineBytes.Skip(startingPosition).Take(width).ToArray();

   // .....     200 extra lines of processing into ints, dates, strings       ......
   // ..... Even with the extra work, it operates at 1200+ records per second ......

}
sw.Stop();
var seconds = sw.Elapsed.TotalSeconds;
sw.Reset();
Console.WriteLine("Elapsed: " + seconds);
TempTable.Rows.Add(dataRow);

When I started this, I expected huge performance gains by moving code to unmanaged C++ from C#. This is my first C++ project and I am frankly just a bit discouraged about where I am. What can be done to speed up this C++? Do I need to use different datatypes, malloc, more / less structs?

It needs to run on Windows, not sure if there is a way to get GCC 10 to work on Windows?

What suggestions do you have for an aspiring C++ Developer?

Alan · Accepted Answer

Ok, so I was able to get C++ processing the file at around 50,000 rows per second with 80 columns per row. I reworked the entire workflow to make sure it didn't have to backtrack at all. I first read the entire file into ByteArray and then would go over it line by line by moving data from one array to another rather than specifying each byte in a for loop. I then used a map to store the data.

    stopwatch<> sw;
    while (CurrentLine < TotalLines)
    {
        int BufferOffset = CurrentLine * LineLength;
        std::move(ByteArray + BufferOffset, ByteArray + BufferOffset + LineLength, LineByteArray);
        for (int i = 0; TotalColumns > i + 1; ++i)
        {
            int ThisStartingPosition = StartingPosition[i];
            int ThisWidth = ColumnWidths[i];
            std::uint8_t* CurrentColumnBytes;
            CurrentColumnBytes = new uint8_t[ThisWidth];
            {
                std::move(LineByteArray + ThisStartingPosition, LineByteArray + ThisStartingPosition + ThisWidth, CurrentColumnBytes);
                ResultMap[CurrentLine][i] = Format(CurrentColumnBytes, ThisWidth, DataType[i]);
            }
        }
        CurrentLine++;
    }
    std::cout << "Processed" << CurrentLine << " lines in : " << duration_cast(sw.elapsed()) << '
';

I still am a little disappointed because using the Boost Gregorian calendar conversion is unavailable using Clang to compile, and using the standard MS compiler makes it nearly 20X slower. With Clang -O3 it was processing 10,700 records in 0.25 seconds including all int and string conversions. I will just have to write my own date conversion.

C++ Perfomance Per Compiler, 200 times slower than C#

PC 1

PC 2

Wandbox

C# - 200X Faster

Answers (2)

Related Questions