Reputation: 3336
I am looking for a quick way to store strings from a file into a vector of strings such that I can reserve the number of lines ahead of time. What is the best way to do this? Should I cont the new line characters first or just get a total size of the file and just reserve say the size / 80 in order to give a rough estimate on what to reserve. Ideally I don't want to have the vector have to realloc each time which would slow things down tremendously for a large file. Ideally I would count the number of items ahead of time but should I do this by opening in binary mode counting the new lines and then reopening? That seems wasteful, curious on some thoughts for this. Also is there a way to use emplace_back to get rid of the temporary somestring in the getline code below. I did see the following 2 implmentations for counting the number of lines ahead of time Fastest way to find the number of lines in a text (C++)
std::vector<std::string> vs;
std::string somestring;
std::ifstream somefile("somefilename");
while (std::getline(somefile, somestring))
vs.push_back(somestring);
Also I could do something to get the total size ahead of time, can I just transform the char* in this case into the vector directly? This goes back to my reserve hint of saying size / 80 or some constant to give an estimated size to the reserve upfront.
#include <iostream>
#include <fstream>
int main () {
char* contents;
std::ifstream istr ("test.txt");
if (istr)
{
std::streambuf * pbuf = istr.rdbuf();
//which I can use as a reserve hint say size / 80
std::streamsize size = pbuf->pubseekoff(0,istr.end);
//maybe I can construct the vector from the char buf directly?
pbuf->pubseekoff(0,istr.beg);
contents = new char [size];
pbuf->sgetn (contents,size);
}
return 0;
}
Upvotes: 0
Views: 514
Reputation: 42132
The strategy for reserving space in a std::vector
is designed to "grow on demand". That is, you will not allocate one string at a time, you will first allocate one, then, say, ten, then, one hundred and so on (not exactly those numbers, but that's the idea). In other word, the implementation of std::vector::push_back already manages this for you.
Consider the following example: I am reading the entire text of War and Peace (65007 lines) using two versions: one which allocates and one which does not (i.e., one reserves zero space, and the other reserves the full 65007 lines; text from: http://www.gutenberg.org/cache/epub/2600/pg2600.txt)
#include<iostream>
#include<fstream>
#include<vector>
#include<string>
#include<boost/timer/timer.hpp>
void reader(size_t N=0) {
std::string line;
std::vector<std::string> lines;
lines.reserve(N);
std::ifstream fp("wp.txt");
while(std::getline(fp, line)) {
lines.push_back(line);
}
std::cout<<"read "<<lines.size()<<" lines"<<std::endl;
}
int main() {
{
std::cout<<"no pre-allocation"<<std::endl;
boost::timer::auto_cpu_timer t;
reader();
}
{
std::cout<<"full pre-allocation"<<std::endl;
boost::timer::auto_cpu_timer t;
reader(65007);
}
return 0;
}
Results:
no pre-allocation
read 65007 lines
0.027796s wall, 0.020000s user + 0.000000s system = 0.020000s CPU (72.0%)
full pre-allocation
read 65007 lines
0.023914s wall, 0.020000s user + 0.010000s system = 0.030000s CPU (125.4%)
You see, for a non-trivial amount of text I have a difference of milliseconds.
Do you really need to know the lines beforehand? Is it really a bottleneck? Are you saving, say, one second of Wall time but complicating your code ten-fold by preallocating the lines?
Upvotes: 1
Reputation: 597051
Rather than waste time counting the lines ahead of time, I would just reserve()
an initial value, then start pushing the actual lines, and if you happen to push the reserved number of items then just reserve()
some more space before continuing with more pushing, repeating as needed.
Upvotes: 1