Reputation: 627
I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below:
This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.
My code uses only STL library:
int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
// dont do the last row
for (; column < allColumns; column++){
//dont do the last column
getline(file,vertexString,',');
vertexDouble = stold(vertexString);
if (vertexDouble > _TH){
rows[row].add(column);
}
}
// do the last in the column
getline(file,vertexString);
vertexDouble = stold(vertexString);
if (vertexDouble > _TH){
rows[row].add(++column);
}
column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);
init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.
VertexSet
is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).
Upvotes: 14
Views: 5465
Reputation: 131
The inefficiency here is in Microsoft's implementation of std::getline
, which is being used in two places in the code. The key problems with it are:
The profile in the original post shows that the second of these problems is the biggest issue in this case.
I wrote more about the inefficiency of std::getline
here.
GNU's implementation of std::getline
, i.e. the version in libstdc++, is much better.
Sadly, if you want your program to be fast and you build it with Visual C++ you'll have to use lower level functions than std::getline
.
Upvotes: 2
Reputation:
Hmm good answer here. Took me a while but I had the same problem. After this fix my write and process time went from 38 sec to 6 sec. Here's what I did.
First get data using boost mmap. Then you can use boost thread to make processing faster on the const char* that boost mmap returns. Something like this: (the multithreading is different depending on your implementation so I excluded that part)
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/thread/thread.hpp>
#include <boost/lockfree/queue.hpp>
foo(string path)
{
boost::iostreams::mapped_file mmap(path,boost::iostreams::mapped_file::readonly);
auto chars = mmap.const_data(); // set data to char array
auto eofile = chars + mmap.size(); // used to detect end of file
string next = ""; // used to read in chars
vector<double> data; // store the data
for (; chars && chars != eofile; chars++) {
if (chars[0] == ',' || chars[0] == '\n') { // end of value
data.push_back(atof(next.c_str())); // add value
next = ""; // clear
}
else
next += chars[0]; // add to read string
}
}
Upvotes: 0
Reputation: 490408
At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).
Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream
to parse out the numbers.
Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>
class my_ctype : public std::ctype<char> {
std::vector<mask> my_table;
public:
my_ctype(size_t refs=0):
my_table(table_size),
std::ctype<char>(my_table.data(), false, refs)
{
std::copy_n(classic_table(), table_size, my_table.data());
my_table[',']=(mask)space;
}
};
template <class T>
class converter {
std::stringstream buffer;
my_ctype *m;
std::locale l;
public:
converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }
std::vector<T> operator()(std::string const &in) {
buffer.clear();
buffer<<in;
return std::vector<T> {std::istream_iterator<T>(buffer),
std::istream_iterator<T>()};
}
};
int main() {
std::ifstream in("somefile.csv");
std::vector<std::vector<double>> numbers;
std::string line;
converter<double> cvt;
clock_t start=clock();
while (std::getline(in, line))
numbers.push_back(cvt(line));
clock_t stop=clock();
std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}
To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:
#include <iostream>
#include <stdlib.h>
int main() {
for (int i=0; i<1800; i++) {
for (int j=0; j<1800; j++)
std::cout<<rand()/double(RAND_MAX)<<",";
std::cout << "\n";
}
}
This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp
), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.
Upvotes: 5
Reputation: 1862
The debug Runtime Library in VS is very slow because it does a lot of debug checks (for out of bound accesses and things like that) and calls lots of very small functions that are not inlined when you compile in Debug.
Running your program in release should remove all these overheads.
My bet on the next bottleneck is string allocation.
Upvotes: 1
Reputation: 1
I would try read bigger chunks of memory at once and then parse it all. Like.. read full line. and then parse this line using pointers and specialized functions.
Upvotes: 0