Denis Kulagin
Denis Kulagin

Reputation: 8906

Full text search by summaries

Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?

I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.

Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).

Upvotes: 0

Views: 715

Answers (3)

XL Zheng
XL Zheng

Reputation: 363

Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).

To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:

 org.apache.lucene.document.Field.Store.NO

By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.

Upvotes: 1

Alessandro Benedetti
Alessandro Benedetti

Reputation: 1114

but only it's "summary" (you may call it index information or TD-IDF representation).

What you are looking for seems quite standard :

  • Apache Lucene [1], if you look for a library
  • Apache Solr or Elastic Search, if you are looking for a production ready Enterprise Search Server.

How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).

What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .

In Lucene and Solr this is matter of configuration.

Summarisation is a completely different NLP task and is not probably what you need.

Cheers

[1] http://lucene.apache.org/index.html

[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/

Upvotes: 1

Mysterion
Mysterion

Reputation: 9320

Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.

I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.

References for MLT in Elasticsearch, Lucene, Solr

Upvotes: 1

Related Questions