How to update Solr documents on the Solr server side with custom handler / plugin

Question

I have a core with millions of records.
I want to add a custom handler which scan the existing documents and update one of the field based on a condition (age>12 for example).
I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back.
I was thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler).
I was wondering whether there are existing solutions or better alternatives.
I was searching the web for a while and couldn't find examples of Solr plugins which update documents (I don't need to extend the update handler).
I've written a plug-in which use the following code which works fine but isn't as fast as I need.
Currently I do:

AddUpdateCommand addUpdateCommand = new AddUpdateCommand(solrQueryRequest); 
DocIterator iterator = docList.iterator(); 
SolrIndexSearcher indexReader = solrQueryRequest.getSearcher(); 
while (iterator.hasNext()) { 
   Document document = indexReader.doc(iterator.nextDoc()); 
   SolrInputDocument solrInputDocument = new SolrInputDocument(); 
   addUpdateCommand.clear(); 
   addUpdateCommand.solrDoc = solrInputDocument; 
   addUpdateCommand.solrDoc.setField("id", document.get("id")); 
   addUpdateCommand.solrDoc.setField("my_updated_field", new_value); 
   updateRequestProcessor.processAdd(addUpdateCommand); 
}

But this is very expensive since the update handler will fetch again the document which I already hold at hand.
Is there a safe way to update the lucene document and write it back while taking into account all the Solr related code such as caches, extra solr logic, etc?
I was thinking of converting it to a SolrInputDocument and then just add the document through Solr but I need first to convert all fields.
Thanks in advance, Avner

phanin · Accepted Answer

I'm not sure whether the following is going to improve the performance, but thought it might help you.

Look at SolrEntityProcessor

Its description sounds very relevant to what you are searching for.

This EntityProcessor imports data from different Solr instances and cores. 
The data is retrieved based on a specified (filter) query. 
This EntityProcessor is useful in cases you want to copy your Solr index 
and slightly want to modify the data in the target index. 
In some cases Solr might be the only place were all data is available.

However, I couldn't find an out-of-the-box feature to embed your logic. So, you may have to extend the following class.

SolrEntityProcessor and the link to sourcecode

You may probably know, but a couple of other points.

1) Make the entire process exploit all the cpu cores available. Make it multi-threaded.

2) Use the latest version of Solr.

3) Experiment with two Solr apps on different machines with minimal network delay. This would be a tough call :

same machine, two processes VS two machines, more cores, but network overhead.

4) Tweak Solr cache in a way that applies to your use-case and particular implementation.

5) A couple of more resources: Solr Performance Problems and SolrPerformanceFactors

Hope it helps. Let me know the stats despite this answer. I'm curious and your info might help somebody later.

How to update Solr documents on the Solr server side with custom handler / plugin

Answers (2)

Related Questions