Anuj Mehta
Anuj Mehta

Reputation: 1138

Java: Very quick way of searching a string in huge dictionary

I had a huge dictionary containing around 1.2 million strings. As an input I will get a sentence and I need to check for each word of input sentence whether it is present in dictionary or not?

Performance is the highest priority for me hence I would like to keep this dictionary in-memory. I want to complete my dictionary lookup in less than a millisecond. Kindly suggest how I can achieve this? Any existing external API which do this?

Upvotes: 2

Views: 1928

Answers (3)

Byungjoon Lee
Byungjoon Lee

Reputation: 913

I think the 1.2 million strings will not fit in memory or easily overflow the size limitation of your memory (consider a bad case where the average string length 256).

If some kind of pre-processing is allowed, I think you'd better first reduce the sequence of strings into a sequence of words. It means that you first convert your data into another set of data that will easily fit in memory and won't overflow.

After that, I think you can depend on the in-memory data structures such as HashMap.

Upvotes: 0

pratZ
pratZ

Reputation: 3346

If you are open to using external apis, I would suggest you go for elastic search's percolate api. Performance being the priority, this exactly fits your requirement.

The java api can be found here.

You can index all the keywords and then provide it a document(in your case the sentence)

Indexing:

for(String obj:keywordLst){
    client.prepareIndex("myindex", ".percolator", obj)
            .setSource(XContentFactory.jsonBuilder()
                .startObject()
                    .field("query", QueryBuilders.matchPhraseQuery("content", obj)) 
                .endObject())
            .setRefresh(true) 
    .execute().actionGet();
}

Searching:

XContentBuilder docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("doc").startObject(); 
docBuilder.field("content", text);
docBuilder.endObject(); //End of the doc field
docBuilder.endObject(); //End of the JSON root object

PercolateResponse response = client.preparePercolate().setSource(docBuilder)
            .setIndices("myindex").setDocumentType("type")
            .execute().actionGet();


for(PercolateResponse.Match match : response) {
    //found matches
}

Upvotes: 1

Joop Eggen
Joop Eggen

Reputation: 109547

So you only need a set of words from the dictionary and see whether it contains the set of words of the sentence.

Set<String> dictionaryIndex = new HashSet<>();
Set<String> sentence = new HashSet<>();

if (!dictionaryIndex.containsAll(sentence)) {
    ...

However if you want to do more, consider a database, maybe an embedded in-memory database, like H2 or Derby. You can then do more, and a query would be:

SELECT COUNT(*) FROM dictionary WHERE word IN('think', 'possitive', 'human')

You might even consider a spelling library. They keep smaller dictionary and use stemming: 'learn' for learning, learner, learned, learns.

Upvotes: 2

Related Questions