lily
lily

Reputation: 395

Optimize file reading C++

string str1, str2;
vector<string> vec;

ifstream infile;

infile.open("myfile.txt");
while (! infile.eof() )
{
    getline(infile,str1);
    istringstream is;
        is >> str1;
    while (is >> str2)
    { 
        vec.push_back(str2);
    }
} 

What the code does is read the string from the file and stores it into a vector.

Performance needs to be given priority. How can i optimize this code, to make the reading performance faster?

Upvotes: 4

Views: 2804

Answers (4)

Ali
Ali

Reputation: 58431

As others have already pointed out (see for example herohuyongtao's answer), the loop condition and how you put str1 into the istringstream must be fixed.

However, there is an important issue here that everybody has missed so far: you don't need the istringstream at all!

  vec.reserve(the_number_of_words_you_exptect_at_least);

  while (infile >> str1) {

    vec.push_back(str1);
  }

It gets rid of the inner loop which you didn't need in the first place, and doesn't create an istringstream in each iteration.

If you need to parse further each line and you do need an istringstream, create it outside of the loop and set its string buffer via istringstream::str(const string& s).

I can easily imagine your loops being very slow: Heap allocation on Windows is outrageously slow (compared to Linux); I got bitten once.

Andrei Alexandrescu presents (in some sense) a similar example in his talk Writing Quick Code in C++, Quickly. The surprising thing is that doing unnecessary heap allocations in a tight loop like the one above can be slower than the actual file IO. I was surprised to see that.


You didn't tag your question as C++11 but here is what I would do in C++11.

  while (infile >> str1) {

    vec.emplace_back(std::move(str1));
  }

This move constructs the string at the back of the vector, without copying in. We can do it because we don't need the contents of str1 after we have put it into the vector. In other words, there is no need to copy it into a brand new string at the back of the vector, it is sufficient to just move its contents there. The first loop with the vec.push_back(str1); may potentially copy the contents of str1 which is really unnecessary.

The string implementation in gcc 4.7.2 is currently copy on write, so the two loops have identical performance; it doesn't matter which one you use. For now.

Unfortunately, copy on write strings are now forbidden by the standard. I don't know when the gcc developers are going to change the implementation. If the implementation changes, it may make a difference in performance whether you move (emplace_back(std::move(s))) or you copy (push_back(s)).

If C++98 compatibility is important for you, then go with push_back(). Even if the worst thing happens in the future and your string is copied (it isn't copied now), that copy can be turned into a memmove() / memcpy() which is blazing fast, most likely faster than reading the contents of the file from the hard disk so file IO will most likely remain the bottleneck.

Upvotes: 4

herohuyongtao
herohuyongtao

Reputation: 50657

Before any optimization, you need to change

while (! infile.eof() )      // problem 1
{
    getline(infile,str1);
    istringstream is;
        is >> str1;          // problem 2
    while (is >> str2){ 
        vec.push_back(str2);
        }
 }

to

while ( getline(infile,str1) ) // 1. don't use eof() in a while-condition
{
    istringstream is(str1);    // 2. put str1 to istringstream
    while (is >> str2){ 
        vec.push_back(str2);
        }
 }

to make it work as you expected.


P.S. For the optimization part, you don't need to think too much on it unless it becomes a bottleneck. Premature optimization is the root of all evil. However, if you do want to speed it up, check out @Ali's answer for further info.

Upvotes: 2

CouchDeveloper
CouchDeveloper

Reputation: 19098

The fastest, though not fully portable, approach is to load the file into a memory mapped region (see wiki mmap.

Given that you know the size of the file, you now can define forward iterators (possibly pointer to const char) on that memory region which you can use to find the tokens which separate your file into "strings".

Essentially, you repeatedly get a pair of pointers pointing to the first character respectively the end of each "string". From that pair of iterators create your std::string.

This approach has subtle issues though:

  • You need to take care of the character encoding of the file, possibly convert from this character encoding to your desired encoding which is used by your std::string (presumable UTF-8).

  • The "token" to separate strings (usually \n, may be platform dependent, or may depend on which program created the file.

Upvotes: 1

bobah
bobah

Reputation: 18864

Loop condition is wrong. Not a performance issue. Assuming this IO loop is indeed your application's bottleneck. But even if not, it can be a good educational exercise or just a weekend fun.

You have quite a few temporaries and cases of dynamic memory allocation in the loop.

Calling std::vector::reserve() in front of the loop will improve it a bit. Reallocating it manually to emulate x1.2 grow factor opposed to 2x after some size will help as well. std::list may though be more appropriate if the file size is unpredictable.

Using std::istringstream as a tokenizer is very inoptimal. Switching to iterator-based "view" tokenizer (Boost has one) should improve the speed a lot.

If you need it to be very fast and have enough RAM you can memory map the file before reading it. Boost::iostreams can let you get there quick. Generally, though, without Boost you can be twice as fast (Boost is not bad but it has to be generic and work on a dozen of compilers this is why).

If you are a blessed person using Unix/Linux as your development environment run your program under valgrind --tool=cachegrind and you will see all problematic places and how bad they are relative to one other. Also, valgrind --tool=massif will let you identify nubmerous small heap allocated objects which is generally not tolerable in the high performance code.

Upvotes: 1

Related Questions