SvaLopLop
SvaLopLop

Reputation: 979

C++, Algorithm for searching for a word on a line in a large file?

I'm trying to find the best method to look up a which lines in a large file contain a certain word.

For example if you had the following file:

cat dog monkey 
banana chair elephant 
monkey phone platypus cat

I'd like it to be able to return 0, 2 for "cat"

I'd expect the function prototype to look something like this:

std::vector<int> FindWords(std::string word);

I'd like to pre-process the file into some data structure, so that lockups can be done quickly giving the line numbers that word is contained on. I'm aware std::map could do this if there was only one instance of the word but there are more.

What is the most appropriate algorithm for this?

Upvotes: 0

Views: 904

Answers (3)

Cybercartel
Cybercartel

Reputation: 12592

Boyer–Moore string search algorithm is faster then a trie when you are searching for a single string. Most likely you can modify it for multiple string.

Upvotes: 0

sray
sray

Reputation: 584

Build a trie data structure for all the unique words in the file.

For each word in the trie, store the list of line numbers where the word is present in the file. This can be done in a single pass through the file.

You could use a map also to store the list of line numbers for each word, but a trie would be more compact.

C declarations for trie data structure added below. This should give you an idea of how to get started if you wanted to implement yourself.

/*
 * TRIE data structure defined for lower-case letters(a-z)
 */
typedef struct trie {
  char c;                           /* Letter represented by the trie node */
  struct trie *child[26];           /* Child pointers, one for each of the 26 letters of the alphabet */
  bool isTerminal;                  /* If any word ends at that node, TRUE, else FALSE */
  int counts;                       /* Number of lines the word ending at node occurs in the text */
  int lines[MAX_NUM];               /* Line numbers of the word occurences in the text */
} trie;

/*
 * Insert a word into the trie.
 * word - Word which is being inserted
 * line - Line number of word in the text.
 */
void insertToTrie(trie *node, const char *word, int line);

Upvotes: 2

W.F.
W.F.

Reputation: 13988

You could also use std::multimap or if even better std::unordered_multimap as you don't need to iterate through entire map collection only on the elements of a certain value.

Edit: Simple example:

#include <iostream>
#include <unordered_map>

int main() {
   std::unordered_multimap<std::string, int> mymap;
   mymap.insert(std::pair<std::string, int>("word", 1));
   mymap.insert(std::pair<std::string, int>("anotherword", 2));
   mymap.insert(std::pair<std::string, int>("word", 10));
   for (auto it = mymap.find("word"); it != mymap.end() && it->first == "word"; it++) {
      std::cout << it->second << std::endl;
   }

}

Upvotes: 0

Related Questions