Reputation: 61
I have a Java based application and a set of keywords in a MySQL database (in total about 3M keywords, each of them may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
The user interacts with the application by uploading a document with arbitrary text (several pages most of the times). What I want to do is to search if and where in the document any of the 3 million keywords appear.
I have tried using a loop and searching the document for each keyword but this is not efficient at all. I am wondering if there is a library to perform the search in a more time efficient manner.
I would greatly appreciate any help.
Upvotes: 6
Views: 1205
Reputation: 5414
project Apache Lucene may be helpful.
Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
you can find some useful tutorials here
Upvotes: 5
Reputation: 32145
You can use The Lemur Project also available at sourceforge:
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine and ClueWeb09 dataset.
And as Recommended by Taher the Apache Lucene is a nice tool, And I've used both of them and they're great.
Upvotes: 1
Reputation: 8865
You could try using a bloom filter http://en.wikipedia.org/wiki/Bloom_filter. Then check each word(s) against the bloom filter to find out positives. Please remember there could be false positives. Therefore if there are positives from the bloom filter then you could try a sql query like 'select keyword from keywordtable where keyword in (positives from bloom filter) ' to concretely identify which keywords are present in the uploaded document.
Java implementation of bloom filter available in Guava library. http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/BloomFilter.html
Upvotes: 1