Antonio Leitao
Antonio Leitao

Reputation:

Is there an efficient algorithm to perform inverted full text search?

I have a limited list of thousands of keywords (of one or more words in each keyword) in a database. I want to efficiently find which of those keywords are in given input text without having to test each of those keywords one by one (full table scan). Allowing to match some misspelled words in the text would be even better but not essential. Any algorithm/article suggestions to solve this?

Upvotes: 5

Views: 1362

Answers (6)

user448810
user448810

Reputation: 17866

Search the internet for the Aho-Corasick algorithm.

Upvotes: 0

Matthieu M.
Matthieu M.

Reputation: 300349

The right way to do so would be to build an Automaton.

An Automaton is (by definition) built to recognize words of a language, which in this case is your collection of keywords.

Of course, in your case, it's a bit more complicated, because you may have overlapping keywords, but it's not that different either.

So, from a pure theoretical point of view:

  • Build an Automaton from your set of keywords
  • Scan the body of text thanks to that Automaton

The latter part is O(n), the former may be a bit trickier.

Upvotes: 0

Jalayn
Jalayn

Reputation: 9284

I don't know which database you are using but know that if it is Oracle you should have access to Oracle Text functionality (or you can ask your DBA to enable it). With the help of additional functions such as CONTAINS and proper usage of Oracle Text indexes you may achieve exactly what you want, even looking up "misspelled" words. This is done by combining CONTAINS with a function that computes the Levenshtein distance between two strings. In Oracle Text, this function is FUZZY.

There is a perfect example for you in this Oracle documentation.

I don't know other types of databases enough, but I was pretty sure that major vendors do have something in place for searching text. Just googling it up quickly, there is full-text search in:

Anyway, it is a lot quicker to use built-in DBMS functions/procedures rather than creating your own custom functions or even searching with your programming language (though thousands of keywords is not that much)

Edit: After reading your question again, and Dean Harding's answer I felt I did not answer the question properly. Using Oracle Text, instead of using the CONTAINS function, you may use the MATCHES function (see Paragraph 4.1.3) which does exactly that: query a list of keywords stored in a table against some text and returning the ids of the keywords that were found. I'll copy below the example found in the documentation (with my own comments added):

create table queries (
  query_id      number,
  query_string  varchar2(80)
);

// Here we populate the table with the keywords
insert into queries values (1, 'oracle');
insert into queries values (2, 'larry or ellison');
insert into queries values (3, 'oracle and text');
insert into queries values (4, 'market share');

create index queryx on queries(query_string)
  indextype is ctxsys.ctxrule;

// This query will return the ids of the matched keywords
select query_id from queries
 where matches(query_string, 
               'Oracle announced that its market share in databases 
                increased over the last year.')>0

I hope that it helps a little bit more than my first attempt.

Edit2: just to add that you don't perform a full table scan with this method since you are using a domain index.

Upvotes: 1

Dean Harding
Dean Harding

Reputation: 72678

I think some of the answers so far are misunderstanding the problem that is being asked. My understanding is that you've got a (largish) list of words and a (largish) body of text. You want to know which words both lists have in common, is that correct?

If so, this isn't really a full-text problem at all. Basically, you just have two lists of words (your original keywords, and a list of words from the input text). If you sort both lists you can scan through both lists simultaneously and extract the words that are in common.

Assuming the list of keywords is already sorted, you can extract and sort the words from the body of text in O(n logn) time, and then scanning through both lists simultaneously is O(n+m) (where n is the number of words in the body of text and m is the number of words in the keyword list).

Upvotes: 4

Jonno
Jonno

Reputation: 799

You can use any of the given solutions described, however they all fundamentally do the same thing (then add bells and whistles)

So the basic approach is:

  1. Split your search input into words
  2. Remove noise (this might be very common words, punctuation, etc)
  3. Your database should contain your keyphrases (you said a keyword could have multiple words, so its a keyphrase)
  4. You have a map of the words in a keyphrase

e.g.

keyphraseID:1 keyphrase:"the quick brown fox"

wordID:1 Word:the

wordID:2 Word:quick

mapID: 1 keyphraseID: 1 wordID: 1 position: 1

mapID: 2 keyphraseID: 1 wordID: 2 position: 2

Now you have a basic inverse index, and by putting a unique index on the word column - you avoid a full table scan.

Fuzzyness/spelling adjustments can be added by including things like Levenshtein distance, soundex or latent semantic indexing - a lot depends upon what you are doing, there is loads of research available on this.

Edit:-

You could also look at stemming - where you take words back to the same root e.g. fixing and fixed both root in 'fix'. A common stemming algorithm is called porter.

Upvotes: 1

Samuel García
Samuel García

Reputation: 2225

You can use solutions like Apache Lucene or Apache Solr (Lucene based HTTP full-text service) to achieve this.

Apache Lucene provides an API to index whatever data you imagine.

Based on your needs you can choose between one or another:

  • Lucene: index embedded in your application. In addition, Lucene offers file-system index storage or in-memory. Moreover, one important consideration is that Lucene is not implemented in all programming languages.
  • Solr: Centralized index service. If you need to access to this info in a distributed way, Solr is the way to go. Also, it offers some abstraction from Lucene hard concepts, making it easy to use. Due to the use of HTTP protocol and XML to exchange information between points, is the finest solution between heterogeneous platforms.

About the spellcheking feature, Solr includes a spellcheck "suggestions" using a built-in component

Some tutorials/info:

  • Solr: Wiki is the only one resource that you need.
  • Lucene: In StackOverflow you can find questions about tutorials, like this

Upvotes: 1

Related Questions