mchen
mchen

Reputation: 10156

Performance bottleneck with CSV parser

My current parser is given below - Reading in ~10MB CSV to an STL vector takes ~30secs, which is too slow for my liking given I've got over 100MB which needs to be read in every time the program is run. Can anyone give some advice on how to improve performance? Indeed, would it be faster in plain C?

int main() {
    std::vector<double> data;
    std::ifstream infile( "data.csv" );
    infile >> data;
    std::cin.get();
    return 0;
}

std::istream& operator >> (std::istream& ins, std::vector<double>& data)
{
    data.clear();

    // Reserve data vector
    std::string line, field;
    std::getline(ins, line);
    std::stringstream ssl(line), ssf;

    std::size_t rows = 1, cols = 0;
    while (std::getline(ssl, field, ',')) cols++;
    while (std::getline(ins, line)) rows++;

    std::cout << rows << " x " << cols << "\n";

    ins.clear(); // clear bad state after eof
    ins.seekg(0);

    data.reserve(rows*cols);

    // Populate data
    double f = 0.0;
    while (std::getline(ins, line)) {
        ssl.str(line);
        ssl.clear();
        while (std::getline(ssl, field, ',')) {
            ssf.str(field);
            ssf.clear();
            ssf >> f;
            data.push_back(f);
        }
    }
    return ins;
}

NB: I have also have openMP at my disposal, and the contents will eventually be used for GPGPU computation with CUDA.

Upvotes: 0

Views: 915

Answers (4)

xwlan
xwlan

Reputation: 564

apparently, file io is a bad idea, just map the whole file into memory, access the csv file as a continous vm block, this incur only a few syscall

Upvotes: -1

brian beuning
brian beuning

Reputation: 2862

On my machine, your reserve code takes about 1.1 seconds and your populate code takes 8.5 seconds.

Adding std::ios::sync_with_stdio(false); made no difference to my compiler.

The below C code takes 2.3 seconds.

int i = 0;
int j = 0;
while( true ) {
    float x;
    j = fscanf( file, "%f", & x );
    if( j == EOF ) break;
    data[i++] = x;
    // skip ',' or '\n'
    int ch = getc(file);
}

Upvotes: 3

Olaf Dietsche
Olaf Dietsche

Reputation: 74078

You could half the time by reading the file once and not twice.

While presizing the vector is beneficial, it will never dominate runtime, because I/O will always be slower by some magnitude.

Another possible optimization could be reading without a string stream. Something like (untested)

int c = 0;
while (ins >> f) {
    data.push_back(f);
    if (++c < cols) {
        char comma;
        ins >> comma; // skip comma
    } else {
        c = 0; // end of line, start next line
    }
}

If you can omit the , and separate the values by white space only, it could be even

while (ins >> f)
    data.push_back(f);

or

std::copy(std::istream_iterator<double>(ins), std::istream_iterator<double>(),
          std::back_inserter(data));

Upvotes: 5

Aasmund Eldhuset
Aasmund Eldhuset

Reputation: 37970

Try calling

std::ios::sync_with_stdio(false);

at the start of your program. This disables the (allegedly quite slow) synchronization between cin/cout and scanf/printf (I have never tried this myself, but have often seen the recommendation, such as here). Note that if you do this, you cannot mix C++-style and C-style IO in your program.

(In addition, Olaf Dietsche is completely right about only reading the file once.)

Upvotes: 2

Related Questions