I am currently working as a school project in developing a spelling checker in C++. For the part which consists in checking if a word exists, I currently do the following : I found online a .txt file with all the english existing words My script starts by going through this text files and placing each of its entry in a map object, for an easy access. The problem with this approach is that when the program starts, the step 2) takes approximately 20 secs. This is not a big deal in itself but I was wondering if any of you had an idea of an other approach to have my database of words quickly available. For instance, would there be a way to store the map object in a file, so that I don't need to build it from the text file every time ?

Reputation: 977

How can I easily and quickly store a big word database?

I am currently working as a school project in developing a spelling checker in C++. For the part which consists in checking if a word exists, I currently do the following :

I found online a .txt file with all the english existing words
My script starts by going through this text files and placing each of its entry in a map object, for an easy access.

The problem with this approach is that when the program starts, the step 2) takes approximately 20 secs. This is not a big deal in itself but I was wondering if any of you had an idea of an other approach to have my database of words quickly available. For instance, would there be a way to store the map object in a file, so that I don't need to build it from the text file every time ?

Upvotes: 0

Answers (3)

mindthegap

Reputation: 363

I'm a little bit surprised that nobody came up with the idea of serialization yet. Boost provides great support for such a solution. If I understood you correctly, the problem is that it takes too long to read in your list of words (and put them into a data structure that hopefully provides fast look-up operations), whenever you use your application. Building up such a structure, then saving it into a binary file for later reuse should improve the performance of your application (based on the results presented below).

Here is a piece of code (and a minimal working example, at the same time) that might help you out on this.

#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp> 

#include "prettyprint.hpp"

class Dictionary {
public:
  Dictionary() = default;
  Dictionary(std::string const& file_)
    : _file(file_)
  {}

  inline size_t size() const { return _words.size(); }

  void build_wordset()
  {
    if (!_file.size()) { throw std::runtime_error("No file to read!"); }

    std::ifstream infile(_file);
    std::string line;

    while (std::getline(infile, line)) {
      _words.insert(line);
    }
  }

  friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
  {
    os << d._words;  // cxx-prettyprint used here
    return os;
  }

  int save(std::string const& out_file) 
  { 
    std::ofstream ofs(out_file.c_str(), std::ios::binary);
    if (ofs.fail()) { return -1; }

    boost::archive::binary_oarchive oa(ofs); 
    oa << _words;
    return 0;
  }

  int load(std::string const& in_file)
  {
    _words.clear();

    std::ifstream ifs(in_file);
    if (ifs.fail()) { return -1; }

    boost::archive::binary_iarchive ia(ifs);
    ia >> _words;
    return 0;
  }

private:
  friend class boost::serialization::access;

  template <typename Archive>
  void serialize(Archive& ar, const unsigned int version)
  {
    ar & _words;
  }

private:
  std::string           _file;
  std::set<std::string> _words;
};

void create_new_dict()
{
  std::string const in_file("words.txt");
  std::string const ser_dict("words.set");

  Dictionary d(in_file);

  auto start = std::chrono::system_clock::now();
  d.build_wordset();
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Building up the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;

  d.save(ser_dict);
}

void use_existing_dict()
{
  std::string const ser_dict("words.set");

  Dictionary d;

  auto start = std::chrono::system_clock::now();
  d.load(ser_dict);
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Loading in the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;
}

int main()
{
  create_new_dict();
  use_existing_dict();
  return 0;
}

Sorry for not putting the code into separated files and for the poor design; however, for demonstrating purposes it should be just enough.

Note that I didn't use a map: I just don't see the point of storing a lot of zeros or anything else unnecessarily. AFAIK, a std::set is backed by the same powerful RB-Tree as std::maps.

For the data set available here (it contains around 466k words), I got the following results:

Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544

Dependencies:

Boost's Serialization component (however, I used version 1.58).
louisdx/cxx-prettyprint.

Hope this helps. :) Cheers.

Upvotes: 0

Andrew Dwojc

Reputation: 119

First things first. Do not use a map (or a set) for storing a word list. Use a vector of strings, make sure its contents are sorted (I would believe your word list is already sorted), and then use binary_find from the <algorithm> header to check if a word is already in the dictionary.

Although this may still be highly suboptimal (depending on whether your compiler does a small string optimisation), your load times will improve by at least an order of magnitude. Do a benchmark and if you want to make it faster still, post another question on the vector of strings.

Upvotes: 0

FrankS101

Reputation: 2135

If your file with all English words is not dynamic, you can just store it in a static map. To do so, you need to parse your .txt file, something like:

alpha

beta

gamma

...

to convert it to something like this:

static std::map<std::string,int> wordDictionary = {
                { "alpha", 0 },
                { "beta", 0 },
                { "gamma", 0 } 
                   ... };

You can do that programatically or simply with find and replace in your favourite text editor.

Your .exe is going to be much heavier than before, but it will also start much faster than reading this info from file.

Upvotes: 1

How can I easily and quickly store a big word database?

Answers (3)

Related Questions