Roam
Roam

Reputation: 4949

integrating Elasticsearch & Stanford NLP without re-indexing

We've been using Elasticsearch in the system. Although i used its analyzers and queries. I didn't do deep into its indexing. as of now, i don't know how far ES lets us work the Lucene (inverted-)indexes it has in its shards.

We're now looking at a range of NLP features-- NER for one thing and Stanford NLP appealed.

There's no plug-in to work these 2 packages together(?)

I haven't had a deep look into Stanford NLP. however - as far as i saw, it's working it all on its own indexes. whichever object or type passed to it, Stanford NLP is indexing it itself and going from there.

This would make the system work 2 different indexes for the same set of documents-- those of ES & StanfordNLP, and this would be costly.

Is there a way to get around this?

One scenario i have is: make StanfordNLP work on Lucene segments-- the inverted indexes that ES already built. In this case:

1.) does StanfordNLP use Lucene indexes without re-indexing anything for itself? i don't know StanfordNLP's indexing structure-- or even how far it uses/doesn't use Lucene.

2.) are there any restrictions on using the Lucene indexes in ES shards? would we hit a rock bottom in using these Lucene segments directly as is, bypassing ES in between?

I'm trying to put things together-- all in the air for now. sorry for the naive Q.

I'm aware of OpenNLP and its plug-in. i haven't checked - i'm guessing it wouldn't be "double-indexing" and using ES's indexes(?) However, it's StanfordNLP we're after.

TIA.

Upvotes: 4

Views: 2459

Answers (2)

Steen
Steen

Reputation: 6849

There is a repository on github that has been experimenting with NER on ElasticSearch using OpenNLP: github page. It uses the ElasticSearch Plugin architecture, so it should be easy to test out in an ES instance. I haven't tried the plugin, but I have experience using OpenNLP from previous jobs, and it has a very solid NER parser.

Upvotes: 0

Christopher Manning
Christopher Manning

Reputation: 9450

Stanford NER neither uses a Lucene/SOLR index, nor makes its own text index. It maps a piece of text or a token sequence to a sequence of tokens with NER annotations.

Typically, you would run NER on each document on ingestion, around the time of tokenization, prior to indexing, and then index each document for entities as well as words.

I know of no existing ElasticSearch plugin for Stanford NER, but it may be informative to look at how people have done this with Solr: http://www.searchbox.com/named-entity-recognition-ner-in-solr/ . Both Solr and ElasticSearch are using Lucene Analyzers and indexes internally.

Upvotes: 6

Related Questions