How to speed up counting the occurences of a word in large files?

Question

I need to count the occurrences of the string "" in a 104gb file, for getting the number of articles in a given Wikipedia dump. First, I've tried this.

grep -F '' enwiki-20141208-pages-meta-current.xml | uniq -c

However, grep crashes after a while. Therefore, I wrote the following program. However, it only processes 20mb/s of the input file on my machine which is about 5% workload of my HDD. How can I speed up this code?

#include 
#include 
#include 

int main()
{
    // Open up file
    std::ifstream in("enwiki-20141208-pages-meta-current.xml");
    if (!in.is_open()) {
        std::cout << "Could not open file." << std::endl;
        return 0;
    }
    // Statistics counters
    size_t chars = 0, pages = 0;
    // Token to look for
    const std::string token = "";
    size_t token_length = token.length();
    // Read one char at a time
    size_t matching = 0;
    while (in.good()) {
        // Read one char at a time
        char current;
        in.read(¤t, 1);
        if (in.eof())
            break;
        chars++;
        // Continue matching the token
        if (current == token[matching]) {
            matching++;
            // Reached full token
            if (matching == token_length) {
                pages++;
                matching = 0;
                // Print progress
                if (pages % 1000 == 0) {
                    std::cout << pages << " pages, ";
                    std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
                }
            }
        }
        // Start over again
        else {
            matching = 0;
        }
    }
    // Print result
    std::cout << "Overall pages: " << pages << std::endl;
    // Cleanup
    in.close();
    return 0;
}

Dietmar K&#252;hl · Accepted Answer

Assuming there are no insanely large lines in the file using something like

for (std::string line; std::getline(in, line); } {
    // find the number of "" strings in line
}

is bound to be a lot faster! Reading each characters as a string of one character is about the worst thing you can possibly do. It is really hard to get any slower. For each character, there stream will do something like this:

Check if there is a tie()ed stream which needs flushing (there isn't, i.e., that's pointless).
Check if the stream is in good shape (except when having reached the end it is but this check can't be omitted entirely).
Call xsgetn() on the stream's stream buffer.
This function first checks if there is another character in the buffer (that's similar to the eof check but different; in any case, doing the eof check only after the buffer was empty removes a lot of the eof checks)
Transfer the character to the read buffer.
Have the stream check if it reached all (1) characters and set stream flags as needed.

There is a lot of waste in there!

I can't really imagine why grep would fail except that some line blows massively over the expected maximum line length. Although the use of std::getline() and std::string() is likely to have a much bigger upper bound, it is still not effective to process huge lines. If the file may contain lines which are massive, it may be more reasonable to use something along the lines of this:

for (std::istreambuf_iterator it(in), end;
     (it = std::find(it, end, '<') != end; ) {
    // match "" at the start of of the sequence [it, end)
}

For a bad implementation of streams that's still doing too much. Good implementations will do the calls to std::find(...) very efficiently and will probably check multiple characters at one, adding a check and loop only for something like every 16th loop iteration. I'd expect the above code to turn your CPU-bound implementation into an I/O-bound implementation. Bad implementation may still be CPU-bound but it should still be a lot better.

In any case, remember to enable optimizations!

How to speed up counting the occurences of a word in large files?

Answers (2)

Related Questions