Reputation: 389

Reducing time complexity of string comparison

I have a dictionary .txt file with probably over a thousand words and their definitions. I've already written a program to take the first word of each line from this file and check it against a string input by the user:

void checkWord(string input)
{

    std::ifstream inFile;
    inFile.open("Oxford.txt");
    if (inFile.is_open())
    {
        string line; //there is a "using std::string" in another file
        while (getline(inFile, line))
        {
            //read the first word from each line
            std::istringstream iss(line);
            string word;
            iss >> word;
            //make sure the strings being compared are the same case
            std::transform(word.begin(), word.end(), word.begin(), ::tolower);
            std::transform(input.begin(), input.end(), input.begin(), ::tolower);
            if (word == input)
            {
                //Do a thing with word
            }
        }
        inFile.close();
        return "End of file";
    }
    else
    {
        return "Unable to open file";
    }

}

But if I'm checking more than a sentence, the time it takes to process becomes noticeable. I've thought about about a few ways of making this time shorter:

Making a .txt file for each letter of the alphabet (Pretty easy to do, but not really a fix in the long-term)
Using unordered_set to compare the strings (like in this question) the only problem with this might be the initial creation of these maps from the text file
Using some other data structure to compare strings? (Like std::map)

Given that the data is already "sorted", what kind of data structure or method should I employ in order to (if possible) reduce time complexity? Also, are there any issues with the function I am using to compare strings? (for example, would string::compare() be quicker than "=="?)

Upvotes: 2

Answers (4)

Alexandre

Reputation: 415

You can benefit from using a prefix tree, also known as a trie data structure, as it fits the use case of having a dictionary and frequently looking up words in it.

The simplest model of a trie is a tree where each node holds a letter and a flag to tell whether the current letter is the end of a word (and, additionally, pointers to other data about the word).

Example picture of a trie containing the dictionary aback abate bid bird birth black blast:

To search for a word, start from the root, and for each letter of your word, follow the node containing the current letter (halt if it isn't present as a child of the current node). The search time is proportional to the look up word length, instead of to the size of your dictionary.

A trie also allows you to easily get the alphabetic (lexicographical) order of words in a dictionary: just do a pre-order traversal of it.

Upvotes: 1

Mats Petersson

Reputation: 129314

So, there are "simple" fixes, and there are some more complex ones.

The first step is to move all unnecessary things out of the search-loop: Lowercase input once, before the loop, rather than every time - after all, it's not changing. If possible, lowercase the Oxford.txt too, so you don't have to lowercase word for every line.

If you are searching the file multiple times, reading a file multiple times is definitely not a great solution - even if it's cached in the filesystem the second time.

So reading it once into some container, really simple one would be std::vector [and lower-case the string at the same time] and just iterating over it. The next improvement would be to sort the vector and us a binary search (but you'd have to write the binary search yourself - it's not terribly hard)

A slightly more complex solution [but faster to search] would be to use std::map<std::string, std::string> wordlist (but that also takes a bit more space), then use auto pos = wordlist.find(input); if (pos != wordlist.end() ... found word ....

Upvotes: 1

o11c

Reputation: 16046

Instead of storing everything in a .txt file, store it in a real database.

SQLite3 is a good choice for simple tasks, since it is in-process instead of requiring an external server.

For a very simple, the C API and SQL statements should be very easy to learn.

Something like:

-- Only do this once, for setup, not each time you run your program.
sqlite> CREATE TABLE dictionary (word TEXT PRIMARY KEY);
sqlite> .import /usr/share/dict/words dictionary;
-- Do this every time you run your program.
sqlite> select count(*) from dictionary where word = 'a';
1

Upvotes: 0

Lightness Races in Orbit

Reputation: 385104

A tree (std::map)? Or a hashmap (std::unsorted_map)? Your linear search is obviously a brute force solution! Both of the above will be substantially superior for multiple searches.

Of course, that only really helps if you are going to use this data multiple times per program run, which you didn't specify in your question. If not, there's not really much benefit in loading and parsing and storing all the data only to perform a single lookup then quit. Just put a break in on success, at least.

You imply that your input file is sorted. You could hack together a binary search solution with file seeking (which is really cheap) and snapping to the nearest newline on each iteration to determine roughly where all the words with the same leading (say) three characters are in your file. For a thousand entries, though, this is probably overkill.

Upvotes: 6

Reducing time complexity of string comparison

Answers (4)

Related Questions