Nitin
Nitin

Reputation: 21

How can we implement archiving in SOLR?

I am new to Apache SOLR and I want to implement archiving in SOLR since my data is growing day by day. I am not very sure whether SOLR allows data archiving or not? If anybody has any suggestions on this then please give it to me.

Upvotes: 2

Views: 891

Answers (1)

Gus
Gus

Reputation: 6881

This question is pretty general so it's a bit hard to give a cut and dried answer, but if one thinks about archiving for a moment, there are two parts to it.

  1. Removing old data
  2. Storing the old data in an alternate location.

The first part is fairly easy in solr so long as you can identify a query that will select the "old" documents. For example if you have a field that records when you sent the data to solr called 'index_date' want to delete everything before Jan 1, 2014 you might do this:

curl http://localhost:8983/solr/update --data '<delete><query>indexed_date:[* TO 2014-01-01T00:00:00]</query></delete>' -H 'Content-type:text/xml; charset=utf-8'

The second part requires more thought. The first question is, why would you want to move the data in solr to some other location. The answer to that more or less has to be because you think you might need it again. But ask yourself what the use case for that is, and you you might service that use case. Are you planning on putting the data back into solr at some later point if you want it? Is solr the only place where this data was stored and you need it for record keeping/audit only?

You will have to determine the second half of "archiving" based on your needs, but here's some things to think about: The data behind fields in solr that are stored="false" are already lost. You can not completely reconstruct the data that went into creating them. Fields for which stored="true" can be retrieved in xml/json/csv with a regular query, and then output to the long term storage of your choice. Many systems use solr as an index into the primary sources rather than using solr as a primary source itself. In this case there may be no need to archive the data, simply remove the data that is too old to be relevant in the search results, but of course make sure that your business team understands and agrees with this strategy before you do it! :)

EDIT: I happened to look back at this and when I re-read it I realized I left something out and there's a new development.

What I Left Out

The above delete by query strategy has the drawback that deleted documents remain in the index (just marked deleted), potentially wasting as much as 50% of your space (or more if you've run "optimize"! in the past). Here's a good article by Eric Erickson about deleting and space consequences:

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

New Development

If time is the criteria for deletion and you followed the best practice I mentioned above about not having solr be the single source of truth (i.e. solr is just an index into a primary source, not the data store) then you may very well want to use the new Time Routed Aliases feature which keeps a set of temporally bounded collections and deletes the oldest collections. The great thing about deleting a collection rather than delete by query is that there's no merging to do. The segments for the index disappear as a whole, so there are no deleted documents hanging out wasting space.

http://lucene.apache.org/solr/guide/7_4/time-routed-aliases.html

Self Promotion Disclaimer: Along with David Smiley, I helped write this feature

Upvotes: 2

Related Questions