Reputation: 627

Big csv file c++ parsing performance

I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below: enter image description here

This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.

My code uses only STL library:

int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
    // dont do the last row
    for (; column < allColumns; column++){
        //dont do the last column
        getline(file,vertexString,','); 
        vertexDouble = stold(vertexString);
        if (vertexDouble > _TH){
            rows[row].add(column);
        }
    }
    // do the last in the column
    getline(file,vertexString);
    vertexDouble = stold(vertexString);
    if (vertexDouble > _TH){
        rows[row].add(++column);
    }
    column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);

init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.

VertexSet is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).

Upvotes: 14

Answers (5)

dmr195

Reputation: 131

The inefficiency here is in Microsoft's implementation of std::getline, which is being used in two places in the code. The key problems with it are:

It reads from the stream one character at a time
It appends to the string one character at a time

The profile in the original post shows that the second of these problems is the biggest issue in this case.

I wrote more about the inefficiency of std::getline here.

GNU's implementation of std::getline, i.e. the version in libstdc++, is much better.

Sadly, if you want your program to be fast and you build it with Visual C++ you'll have to use lower level functions than std::getline.

Upvotes: 2

user4513421

Reputation:

Hmm good answer here. Took me a while but I had the same problem. After this fix my write and process time went from 38 sec to 6 sec. Here's what I did.

First get data using boost mmap. Then you can use boost thread to make processing faster on the const char* that boost mmap returns. Something like this: (the multithreading is different depending on your implementation so I excluded that part)

#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/thread/thread.hpp>
#include <boost/lockfree/queue.hpp>

foo(string path)
{
    boost::iostreams::mapped_file mmap(path,boost::iostreams::mapped_file::readonly);
    auto chars = mmap.const_data(); // set data to char array
    auto eofile = chars + mmap.size(); // used to detect end of file
    string next = ""; // used to read in chars
    vector<double> data; // store the data
    for (; chars && chars != eofile; chars++) {
        if (chars[0] == ',' || chars[0] == '\n') { // end of value
            data.push_back(atof(next.c_str())); // add value
            next = ""; // clear
        }
        else
            next += chars[0]; // add to read string
    }
}

Upvotes: 0

Jerry Coffin

Reputation: 490408

At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).

Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream to parse out the numbers.

Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>

class my_ctype : public std::ctype<char> {
    std::vector<mask> my_table;
public:
    my_ctype(size_t refs=0):
        my_table(table_size),
        std::ctype<char>(my_table.data(), false, refs) 
    {
        std::copy_n(classic_table(), table_size, my_table.data());
        my_table[',']=(mask)space;
    }
};

template <class T>
class converter {
    std::stringstream buffer;
    my_ctype *m;
    std::locale l;
public:
    converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }

    std::vector<T> operator()(std::string const &in) {
        buffer.clear();
        buffer<<in;
        return std::vector<T> {std::istream_iterator<T>(buffer),
            std::istream_iterator<T>()};        
    }
};

int main() {
    std::ifstream in("somefile.csv");
    std::vector<std::vector<double>> numbers;

    std::string line;
    converter<double> cvt;

    clock_t start=clock();
    while (std::getline(in, line))
        numbers.push_back(cvt(line));
    clock_t stop=clock();
    std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}

To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:

#include <iostream>
#include <stdlib.h>

int main() {
    for (int i=0; i<1800; i++) {
        for (int j=0; j<1800; j++)
            std::cout<<rand()/double(RAND_MAX)<<",";
        std::cout << "\n";
    }
}

This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.

Upvotes: 5

Jerem

Reputation: 1862

The debug Runtime Library in VS is very slow because it does a lot of debug checks (for out of bound accesses and things like that) and calls lots of very small functions that are not inlined when you compile in Debug.

Running your program in release should remove all these overheads.

My bet on the next bottleneck is string allocation.

Upvotes: 1

antoshka

Reputation: 1

I would try read bigger chunks of memory at once and then parse it all. Like.. read full line. and then parse this line using pointers and specialized functions.

Upvotes: 0

Big csv file c++ parsing performance

Answers (5)

Related Questions