Reputation: 679
I wrote a CGI script for my website which reads through blocks of text and matches all occurrences of English words. I've been making some fundamental changes to the site's code recently which have necessitated rewriting most of it in C++. As I'd hoped, almost everything has become much faster in C++ than perl, with the exception of this function.
I know that regexes are a relatively recent addition to C++ and not necessarily its strongest suit. It may simply be the case that it is slower than perl in this instance. But I wanted to share my code in the hopes that someone might be able to find a way of speeding up what I am doing in C++.
Here is the perl code:
open(WORD, "</file/path/wordthree.txt") || die "opening";
while(<WORD>) {
chomp;
push @wordlist,$_;
}
close (WORD) || die "closing";
foreach (@wordlist) {
while ($bloc =~ m/$_/g) {
$location = pos($bloc) - length($_);
$match=$location.";".pos($bloc).";".$_;
push(@hits,$match);
}
}
wordthree.txt is a list of ~270,000 English words separated by new lines, and $bloc is 3200 characters of text. Perl performs these searches in about one second. You can see it in play here if you like: http://libraryofbabel.info/anglishize.cgi?05y-w1-s3-v20:1
With C++ I have tried the following:
typedef std::map<std::string::difference_type, std::string> hitmap;
hitmap hits;
void regres(const boost::match_results<std::string::const_iterator>& what) {
hits[what.position()]=what[0].str();
}
words.open ("/file/path/wordthree.txt");
std::string wordlist[274784];
unsigned i = 0;
while (words >> wordlist[i]) {i++;}
words.close();
for (unsigned i=0;i<274783;i++) {
boost::regex word(wordlist[i]);
boost::sregex_iterator lex(book.begin(),book.end(), word);
boost::sregex_iterator end;
std::for_each(lex, end, ®res);
}
The C++ version takes about 12 seconds to read the same amount of text the same number of times. Any advice on how to make it competitive with the perl script is greatly appreciated.
Upvotes: 0
Views: 404
Reputation: 392954
Firstly I'd cut down on the number of allocations:
string_ref
instead of std::string
where possibleconst char*
instead std::string::const_iterator
to navigate the bookHere is a sample that uses Boost Spirit Qi to parse the wordlist (I don't have yours, so I assume line-separated words).
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
In full Live On Coliru¹
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace qi = boost::spirit::qi;
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
namespace boost { namespace spirit { namespace traits {
template <typename It>
struct assign_to_attribute_from_iterators<sref, It, void> {
static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
};
} } }
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
std::cout << "Wordlist contains " << wordlist.size() << " entries\n";
io::mapped_file_source book("/etc/dictionaries-common/words");
for (auto const& s: wordlist) {
regex word(s.to_string());
boost::cregex_iterator lex(book.begin(), book.end(), word), end;
std::for_each(lex, end, ®res);
}
}
This still creates a regex each iteration. I have a suspicion it will be a lot more efficient if you combine it all into a single pattern. You'll spend more memory/CPU creating the regex, but you'll reduce the power of the loop by the number of entries in the word list.
Because the regex library might not have been designed for this scale, you could have better results with a custom search and a trie implementation.
Here's a simple attempt (that is indeed much faster for my /etc/dictionaries-common/words
file of 99171 lines):
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
io::mapped_file_params params("/etc/dictionaries-common/words");
params.flags = io::mapped_file::mapmode::priv;
io::mapped_file mapped(params);
std::replace(mapped.data(), mapped.end(), '\n', '|');
regex const wordlist(mapped.begin(), mapped.end() - 1);
io::mapped_file_source book("/etc/dictionaries-common/words");
boost::cregex_iterator lex(book.begin(), book.end(), wordlist), end;
std::for_each(lex, end, ®res);
}
¹ of course coliru doesn't have a suitable wordlist
Upvotes: 1
Reputation: 118330
It looks to me like Perl is smart enough to figure out that you're abusing regular expressions to do an ordinary linear search, a straightforward lookup. You are looking up straight text, and none of your search patterns appear to be, well, a pattern. Based on your description, all your search patterns look like ordinary strings, so Perl is likely optimizing it down to a linear string search.
I am not familiar with Boost's internal implementation of regular expression matching, but it's likely that it's compiling each search string into a state machine, and then executing the state machine for each search. That's the usual approach used with generic regular expression implementations. And that's a lot of work. A lot of completely needless work, in this specific case.
What you should do is as follows:
1) You are reading wordthree.txt
into an array of strings. Instead of doing it, read it into a std::set<std::string>
instead.
2) You are reading the entire text to search into a single book
container. It's not clear, based on your code, whether book
is a single std::string
, or a std::vector<char>
. But whatever the case, don't do that. Read the text to search iteratively, one word at a time. For each word, look it up in the std::set
, and go from there.
This is, after all, what you're trying to do directly, and you should do that instead of taking a grand detour through the wonders of regular expressions, which accomplishes very little other than wasting a lot of time.
If you implement this correctly, you'll likely see C++ being just as fast, if not faster, than Perl.
I could also think of several other, more aggresively optimized approaches which also leverage std::set
, but with custom classes and comparators, that seek to avoid all the heap allocations inherent with using a bunch of std::string
, but it probably won't be necessary. A basic approach using a std::set
-based lookup should be fast enough.
Upvotes: 0