Reputation:
I have a limited list of thousands of keywords (of one or more words in each keyword) in a database. I want to efficiently find which of those keywords are in given input text without having to test each of those keywords one by one (full table scan). Allowing to match some misspelled words in the text would be even better but not essential. Any algorithm/article suggestions to solve this?
Upvotes: 5
Views: 1362
Reputation: 300349
The right way to do so would be to build an Automaton.
An Automaton is (by definition) built to recognize words of a language, which in this case is your collection of keywords.
Of course, in your case, it's a bit more complicated, because you may have overlapping keywords, but it's not that different either.
So, from a pure theoretical point of view:
The latter part is O(n), the former may be a bit trickier.
Upvotes: 0
Reputation: 9284
I don't know which database you are using but know that if it is Oracle you should have access to Oracle Text functionality (or you can ask your DBA to enable it). With the help of additional functions such as CONTAINS
and proper usage of Oracle Text indexes you may achieve exactly what you want, even looking up "misspelled" words. This is done by combining CONTAINS
with a function that computes the Levenshtein distance between two strings. In Oracle Text, this function is FUZZY
.
There is a perfect example for you in this Oracle documentation.
I don't know other types of databases enough, but I was pretty sure that major vendors do have something in place for searching text. Just googling it up quickly, there is full-text search in:
Anyway, it is a lot quicker to use built-in DBMS functions/procedures rather than creating your own custom functions or even searching with your programming language (though thousands of keywords is not that much)
Edit: After reading your question again, and Dean Harding's answer I felt I did not answer the question properly. Using Oracle Text, instead of using the CONTAINS
function, you may use the MATCHES
function (see Paragraph 4.1.3) which does exactly that: query a list of keywords stored in a table against some text and returning the ids of the keywords that were found. I'll copy below the example found in the documentation (with my own comments added):
create table queries (
query_id number,
query_string varchar2(80)
);
// Here we populate the table with the keywords
insert into queries values (1, 'oracle');
insert into queries values (2, 'larry or ellison');
insert into queries values (3, 'oracle and text');
insert into queries values (4, 'market share');
create index queryx on queries(query_string)
indextype is ctxsys.ctxrule;
// This query will return the ids of the matched keywords
select query_id from queries
where matches(query_string,
'Oracle announced that its market share in databases
increased over the last year.')>0
I hope that it helps a little bit more than my first attempt.
Edit2: just to add that you don't perform a full table scan with this method since you are using a domain index.
Upvotes: 1
Reputation: 72678
I think some of the answers so far are misunderstanding the problem that is being asked. My understanding is that you've got a (largish) list of words and a (largish) body of text. You want to know which words both lists have in common, is that correct?
If so, this isn't really a full-text problem at all. Basically, you just have two lists of words (your original keywords, and a list of words from the input text). If you sort both lists you can scan through both lists simultaneously and extract the words that are in common.
Assuming the list of keywords is already sorted, you can extract and sort the words from the body of text in O(n logn) time, and then scanning through both lists simultaneously is O(n+m) (where n is the number of words in the body of text and m is the number of words in the keyword list).
Upvotes: 4
Reputation: 799
You can use any of the given solutions described, however they all fundamentally do the same thing (then add bells and whistles)
So the basic approach is:
e.g.
keyphraseID:1 keyphrase:"the quick brown fox"
wordID:1 Word:the
wordID:2 Word:quick
mapID: 1 keyphraseID: 1 wordID: 1 position: 1
mapID: 2 keyphraseID: 1 wordID: 2 position: 2
Now you have a basic inverse index, and by putting a unique index on the word column - you avoid a full table scan.
Fuzzyness/spelling adjustments can be added by including things like Levenshtein distance, soundex or latent semantic indexing - a lot depends upon what you are doing, there is loads of research available on this.
Edit:-
You could also look at stemming - where you take words back to the same root e.g. fixing and fixed both root in 'fix'. A common stemming algorithm is called porter.
Upvotes: 1
Reputation: 2225
You can use solutions like Apache Lucene or Apache Solr (Lucene based HTTP full-text service) to achieve this.
Apache Lucene provides an API to index whatever data you imagine.
Based on your needs you can choose between one or another:
About the spellcheking feature, Solr includes a spellcheck "suggestions" using a built-in component
Some tutorials/info:
Upvotes: 1