Ivan
Ivan

Reputation: 776

Scan text and check whether it contains word from specified list

Requirement

  1. currently we have a list containing more than ten thousands keywords or sentences(the number is N)
  2. Input is a long character string , the length is L

Question: Check whether the character string contains keywords or sentences in list given

The question can be described as word filter on wikipedia, but I didn't find any algorithm on that page. The simplest way to fix this problem is iterator all the keywords or sentences and each time check whether the long text contains such substring. As we have a number of keywords, also considering the long text, the performance is very bad. It uses O(NL) time

Seems the better solution should be finished in O(L). Could anyone give some suggestion about this?

Upvotes: 2

Views: 1162

Answers (1)

Evgeny Kluev
Evgeny Kluev

Reputation: 24647

There are several approaches to this problem having time complexity O(M + L), where L is length of the string and M is combined length of all patterns:

  1. Aho–Corasick string matching algorithm.
  2. Construct a suffix tree for the string using Ukkonen's algorithm, then find the match for each pattern to this suffix tree.
  3. Construct a generalized suffix tree for set of patterns, then find the match between input string and this suffix tree.
  4. Construct a Suffix array for the string, and use it to search each pattern. This approach has time complexity O(M + L + N log L), where N is the number of patterns.
  5. Commentz-Walter algorithm.

You can find details for all these algorithms (except Commentz-Walter algorithm) in this book: Algorithms on Strings, Trees and Sequences by Dan Gusfield.


Several different (simpler) approaches may be used if you can unambiguously extract separate words/sentences from input string.

  1. Prepare a Bloom filter with size, large enough to guarantee low probability of false positives for N patterns. Add to Bloom filter bits, determined by hash functions of keywords/sentences. Then scan the string, extracting consecutive words/sentences, and check if these words/sentences can be found in the Bloom filter. Only if a word/sentence is present in the Bloom filter, search it in the list of patterns. This algorithm has expected time complexity O(M + L) and is space-efficient.
  2. Insert all patterns into a hash set. Then scan the string, extracting consecutive words/sentences, and check if any of them is in the hash set. This has the same expected time complexity O(M + L), is simpler than the Bloom filter approach, but is not space-efficient.
  3. Insert all patterns into a radix tree (trie). Then use it to search words/sentences from the string. This is not very different from the generalized suffix tree approach, but is simpler and has better performance. It has worst-case time complexity O(M + L), is probably somewhat slower than Bloom filter or hash set approaches, and may be made very space-efficient if necessary.

Upvotes: 4

Related Questions