Reputation: 713
I have a big file nearly 800M, and I want to read it line by line.
At first I wrote my program in Python, I use linecache.getline:
lines = linecache.getlines(fname)
It costs about 1.2s.
Now I want to transplant my program to C++.
I wrote these code:
std::ifstream DATA(fname);
std::string line;
vector<string> lines;
while (std::getline(DATA, line)){
lines.push_back(line);
}
But it's slow(costs minutes). How to improve it?
mmap()
, and on windows CreateFileMapping()
will work.My code runs under VS2013, when I use "DEBUG" mode, it takes 162
seconds;
When I use "RELEASE" mode, only 7
seconds!
(Great Thanks To @DietmarKühl and @Andrew)
Upvotes: 0
Views: 2124
Reputation: 54869
First of all, you should probably make sure you are compiling with optimizations enabled. This might not matter for such a simple algorithm, but that really depends on your vector/string library implementations.
As suggested by @angew, std::ios_base::sync_with_stdio(false) makes a big difference on routines like the one you have written.
Another, lesser, optimization would be to use lines.reserve()
to preallocate your vector so that push_back()
doesn't result in huge copy operations. However, this is most useful if you happen to know in advance approximately how many lines you are likely to receive.
Using the optimizations suggested above, I get the following results for reading an 800MB text stream:
20 seconds ## if average line length = 10 characters
3 seconds ## if average line length = 100 characters
1 second ## if average line length = 1000 characters
As you can see, the speed is dominated by per-line overhead. This overhead is primarily occurring inside the std::string
class.
It is likely that any approach based on storing a large quantity of std::string
will be suboptimal in terms of memory allocation overhead. On a 64-bit system, std::string
will require a minimum of 16 bytes of overhead per string. In fact, it is very possible that the overhead will be significantly greater than that -- and you could find that memory allocation (inside of std::string
) becomes a significant bottleneck.
For optimal memory use and performance, consider writing your own routine that reads the file in large blocks rather than using getline()
. Then you could apply something similar to the flyweight pattern to manage the indexing of the individual lines using a custom string class.
P.S. Another relevant factor will be the physical disk I/O, which might or might not be bypassed by caching.
Upvotes: 1
Reputation: 11
For c++ you could try something like this:
void processData(string str)
{
vector<string> arr;
boost::split(arr, str, boost::is_any_of(" \n"));
do_some_operation(arr);
}
int main()
{
unsigned long long int read_bytes = 45 * 1024 *1024;
const char* fname = "input.txt";
ifstream fin(fname, ios::in);
char* memblock;
while(!fin.eof())
{
memblock = new char[read_bytes];
fin.read(memblock, read_bytes);
string str(memblock);
processData(str);
delete [] memblock;
}
return 0;
}
Upvotes: 1