yogesh
yogesh

Reputation: 1

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3. I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search. But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.

So I wanted to check if there is a way to avoid content indexing and update just the metadata? Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?

Upvotes: 0

Views: 1045

Answers (2)

Tirthankar Chatterjee
Tirthankar Chatterjee

Reputation: 11

We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.

So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.

Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.

The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Upvotes: 1

Jesvin Jose
Jesvin Jose

Reputation: 23088

"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.

Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .

I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":

id: docs_contents_watson
type:contents
text: text of the file

and the metadata as

id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984

To search the contents of a document: http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"

To do metadata searches: http://localhost:8080/app/select?q=type:metadata&year:1984

Note the type:xx.

This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.

Upvotes: 1

Related Questions