theotherphil
theotherphil

Reputation: 667

Quickly loading numerical data in C++

I'm working on a feature detection program that uses statistical models of what various sorts of landmarks in images look like. The model uses around 100 different landmarks, and the relevant data for each landmark consists of 16 matrices of doubles, each of size about 160x160.

I'm currently using one text file for each landmark, and storing the values from each matrix as a space-separated line. To read the data, I read a line at a time from each file and pass it to a function that generates a stringstream from the line and then reads the values of the matrix from this stream one at a time.

On my computer this takes about 90 seconds to load the ~40 million doubles the model uses. There must surely be a far quicker way of doing this, but I've not found anything useful from googling, and I've got no experience with this sort of thing.

I'd be very grateful for any suggestions.

Edit: Loki asked me to post code, so I've shown it below. loadFromFile is called once for each landmark. The first line of each landmark file states how many levels the model uses for this landmark (each level uses four matrices; there are four levels by default). It's a horrible mess, but I'm not sure why this is so spectularly slow.

void loadFromFile(string filename)
{
    ifstream modelData(filename.c_str(), ifstream::in);
    string line;    
    getline(modelData,line);
    int numberOfLevels = atoi(line.c_str());

    for(size_t ii = 0; ii < numberOfLevels; ++ii)
        readProfileStats(modelData);        

    modelData.close();              
}

void readProfileStats(ifstream& fileStream)
{
    string line;
    getline(fileStream, line);
    Vector meanProfile = readMatrixFromString(line);

    getline(fileStream, line);
    Matrix principalComponents = readMatrixFromString(line);

    getline(fileStream, line);  
    Matrix covarianceMatrixInverse = readMatrixFromString(line);

    m_statsLevels.push_back(ProfileStats(meanProfile, principalComponents, covarianceMatrixInverse));
}

Matrix readMatrixFromString(const string& line)
{
    stringstream stream(line);

    size_t numRows;
    size_t numCols; 

    stream >> numRows;  
    stream >> numCols;      

    Matrix matrix(numRows,numCols);

    for(int ii = 0; ii < numRows; ++ii)
    {                                       
        for(int jj = 0; jj < numCols; ++jj)             
            stream >> matrix(ii,jj);                                    
    }                                                       

    return matrix;                      
}

Upvotes: 3

Views: 683

Answers (2)

Loki Astari
Loki Astari

Reputation: 264331

Try using the scanf libraries:

r1.cpp

> cat r1.cpp 

#include <iostream>
int main()
{
    double x;
    long   count = 0;
    while(std::cin >> x)
    {
        ++count;
    }
    std::cout << count << ": " << x << "\n";
}

r2.cpp

> cat r2.cpp 

#include <iostream>
#include <stdio.h>

int main()
{
    double x;
    long   count = 0;
    while(fscanf(stdin, "%lf", &x) == 1)
    {
        ++count;
    }
    std::cout << count << ": " << x << "\n";
}

Results Serial

> g++ -O3 r1.cpp -o r1
> time (cat t | ./r1)
40000000: 9.36e+08

real    0m57.669s
user    0m56.992s
sys 0m1.688s
> g++ -O3 r2.cpp -o r2
> time (cat t | ./r2)
40000000: 9.36e+08

real    0m14.419s
user    0m13.897s
sys 0m1.352s

So it took onger than I expected about 60 seconds reading 40,000,000 numbers using IOstream. While only 15 seconds using scanf. So about 4 times faster.

I did the same things but just writing the binary value of the doubles to the file.
Note you have to write them as binary and of course you loose all type safety and portability.

double x;
std::cout.write((char*)&x, sizeof(x));

r1b.cpp

> cat r1b.cpp 

#include <iostream>
int main()
{
    double x;
    long   count = 0;
    while(std::cin.read((char*)&x, sizeof(double)))
    {
        ++count;
    }
    std::cout << count << ": " << x << "\n";
}

r2b.cpp

> cat r2b.cpp 

#include <iostream>
#include <stdio.h>

int main()
{
    double x;
    long   count = 0;
    while(fread(&x, sizeof(double), 1, stdin) == 1)
    {
        ++count;
    }
    std::cout << count << ": " << x << "\n";
}

Result binary

> time (cat t2 | ./r1b )
40000000: 9.36e+08

real    0m3.930s
user    0m3.592s
sys 0m0.984s
> time (cat t2 | ./r2b )
40000000: 9.36e+08

real    0m2.110s
user    0m1.840s
sys 0m0.804s

Upvotes: 1

frankc
frankc

Reputation: 11473

As has been suggested in a comment, the issue here is that the data must be converted from text into numeric values. That can be entirely eliminated by storing the data in a binary format. There are libraries that can handle this, like hdf5. There are a lot of advantages to using a popular library like this as you get an entire pre-built toolchain, as well as support for many other languages besides C++. However, the downside is that there is going to be a good bit of work up front to learn how to use these systems. If this is a one off research project, I recommend you strongly consider a different, simpler approach: once your structure is created the first time, simply fwrite or mmap your data structure into a disk file. Then, create a function that will fread or mmap that binary file directly into your data structure. Give your program the option calling the mmap function instead of the parsing function. You will see a significant speedup from doing things this way.

Upvotes: 1

Related Questions