Reputation: 3752
I've been analyzing the best method to improve the performance of our SOLR index and will likely shard the current index to allow searches to become distributed.
However given that our index is over 400GB and contains about 700MM documents, reindexing the data seems burdensome. I've been toying with the idea of duplicating the indexes and deleting documents as a means to more efficiently create the sharded environment.
Unfortunally it seems that modulus isn't available to query against the document's internal numeric ID. What other possible partitioning strategies could I use to delete by query rather than a full reindex?
Upvotes: 1
Views: 2209
Reputation: 11620
I answered this in another StackOverflow question. There's a command-line utility I wrote (Hash-Based Index Splitter) to split a Lucene index based on each document's ID hash.
Upvotes: 0
Reputation: 1768
If you can find a logical key to partition the data, then it will be helpful in more than one way. For eg. can you have these documents split across shards based on some chronological order?
We have a similar situation. We have an index of 250M docs that are split across various shards based on their created date. A major use case involves searching across these shards based on a range of created date. So, the search is only submitted to the shards that contain the docs with the given date range. There may be other benefits to logically partitioned data - for eg. different capacity planning, applying different qualities of service to search terms etc..
Upvotes: 0
Reputation: 15791
A lucene tool would do the job IndexSplitter, see mentioned here with a link to an article (japanese, tranlate it with google...)
Upvotes: 1