Reputation: 34195
I need to count the occurrences of the string "<page>"
in a 104gb file, for getting the number of articles in a given Wikipedia dump. First, I've tried this.
grep -F '<page>' enwiki-20141208-pages-meta-current.xml | uniq -c
However, grep crashes after a while. Therefore, I wrote the following program. However, it only processes 20mb/s of the input file on my machine which is about 5% workload of my HDD. How can I speed up this code?
#include <iostream>
#include <fstream>
#include <string>
int main()
{
// Open up file
std::ifstream in("enwiki-20141208-pages-meta-current.xml");
if (!in.is_open()) {
std::cout << "Could not open file." << std::endl;
return 0;
}
// Statistics counters
size_t chars = 0, pages = 0;
// Token to look for
const std::string token = "<page>";
size_t token_length = token.length();
// Read one char at a time
size_t matching = 0;
while (in.good()) {
// Read one char at a time
char current;
in.read(¤t, 1);
if (in.eof())
break;
chars++;
// Continue matching the token
if (current == token[matching]) {
matching++;
// Reached full token
if (matching == token_length) {
pages++;
matching = 0;
// Print progress
if (pages % 1000 == 0) {
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Start over again
else {
matching = 0;
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
// Cleanup
in.close();
return 0;
}
Upvotes: 0
Views: 161
Reputation: 4925
I'm using this file to test with: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-current1.xml-p000000010p000010000.bz2
It takes roughly 2.4 seconds versus 11.5 using your code. The total character count is slightly different due to not counting newlines, but I assume that's acceptable since it's only used to display progress.
void parseByLine()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
size_t chars = 0;
size_t pages = 0;
const std::string token = "<page>";
std::string line;
while(std::getline(in, line))
{
chars += line.size();
size_t pos = 0;
for(;;)
{
pos = line.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
if(++pages % 1000 == 0)
{
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
}
Here's an example that adds each line to a buffer and then processes the buffer when it reaches a threshold. It takes 2 seconds versus ~2.4 from the first version. I played with several different thresholds for the buffer size and also processing after a fixed number (16, 32, 64, 4096) of lines and it all seems about the same as long as there is some batching going on. Thanks to Dietmar for the idea.
int processBuffer(const std::string& buffer)
{
static const std::string token = "<page>";
int pages = 0;
size_t pos = 0;
for(;;)
{
pos = buffer.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
++pages;
}
return pages;
}
void parseByMB()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
const size_t BUFFER_THRESHOLD = 16 * 1024 * 1024;
std::string buffer;
buffer.reserve(BUFFER_THRESHOLD);
size_t pages = 0;
size_t chars = 0;
size_t progressCount = 0;
std::string line;
while(std::getline(in, line))
{
buffer += line;
if(buffer.size() > BUFFER_THRESHOLD)
{
pages += processBuffer(buffer);
chars += buffer.size();
buffer.clear();
}
if((pages / 1000) > progressCount)
{
++progressCount;
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
if(!buffer.empty())
{
pages += processBuffer(buffer);
chars += buffer.size();
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
Upvotes: 1
Reputation: 153915
Assuming there are no insanely large lines in the file using something like
for (std::string line; std::getline(in, line); } {
// find the number of "<page>" strings in line
}
is bound to be a lot faster! Reading each characters as a string of one character is about the worst thing you can possibly do. It is really hard to get any slower. For each character, there stream will do something like this:
tie()
ed stream which needs flushing (there isn't, i.e., that's pointless).xsgetn()
on the stream's stream buffer.There is a lot of waste in there!
I can't really imagine why grep
would fail except that some line blows massively over the expected maximum line length. Although the use of std::getline()
and std::string()
is likely to have a much bigger upper bound, it is still not effective to process huge lines. If the file may contain lines which are massive, it may be more reasonable to use something along the lines of this:
for (std::istreambuf_iterator<char> it(in), end;
(it = std::find(it, end, '<') != end; ) {
// match "<page>" at the start of of the sequence [it, end)
}
For a bad implementation of streams that's still doing too much. Good implementations will do the calls to std::find(...)
very efficiently and will probably check multiple characters at one, adding a check and loop only for something like every 16th loop iteration. I'd expect the above code to turn your CPU-bound implementation into an I/O-bound implementation. Bad implementation may still be CPU-bound but it should still be a lot better.
In any case, remember to enable optimizations!
Upvotes: 1