Reputation: 235
I have multiple sources of data from which I want to produce Solr documents. One source is a filesystem, so I plan to iterate through a set of (potentially many) files to collect one portion of the data in each resulting Solr doc. The second source is another Solr index, from which I'd like to pull just a few fields. This second source also could have many (~millions) of records. If it matters, source 1 provides the bulk of the content (the size of each record there is several orders of magnitude greater than that from source 2).
Source 1:
Source 2:
My issue is how best to design this workflow. A few high-level choices include:
So some of the factors affecting the decision include the size of the data (can't afford to be too inefficient in terms of computational time or memory) and the performance of Solr when replacing records (does the original size matter much?).
Any ideas would be greatly appreciated.
Upvotes: 1
Views: 383
Reputation: 7944
Go with option 3 — combine the records before updating.
Presumably you would be using a script to iterate over the files and process them before sending them to your final Solr index. Within that script, query the alternate Solr index to fetch any supplemental field information that it might have, using your shared identifier. Combine that as appropriate with the contents of your file, then send the resulting record to Solr for indexing.
By combining before you update, you don't have to worry about records overwriting each other. You also maintain more control over which source has priority. Furthermore, so long as you're not querying a server on the other side of the country, I will assume that the request time to the alternate Solr index is negligible.
Upvotes: 1
Reputation: 305
I would say if you're not concerned about the data that is stored in two sources being merged first then option 1 or 2 would work fine. I would probably index the larger source first, then "update" with the second.
Upvotes: 1