Reputation: 1730
I have some data which I have to ingest every day into Solr, per day data is around 10-12 GB, and I have to run a catch-up job for last 1 year, every day is around 10-12 GB data.
I am using Java and I need scoring in my data by doing a partial update if same unique key arrives again, I used docValues with TextField.
https://github.com/grossws/solr-dvtf
Initially, I used a sequential approach which took a lot of time(reading from S3 and adding to Solr in batches of 60k).
I found this repo:
https://github.com/lucidworks/spark-solr,
but I couldn't understand the implementation as I needed to modify field data for some scoring logic, so wrote custom spark code.
Then I created 4 nodes in Solr(on the same IP), and used Spark to insert data, initially as the partitions created by Spark were way more than the Solr nodes and also the 'executors' specified were more than nodes, so it took way much more time.
Then I repartitioned the RDD into 4(no. of Solr nodes), specified 4 executors, then insertion took less time and was successful, but when I ran the same for a month, one or more Solr nodes kept on going down, I have enough free space on HD, and rarely my ram usage ends up being full.
Please suggest me a way to solve this problem, and I have 8 core CPU, or should I use a different system for different nodes on Solr?
Thanks!
Upvotes: 2
Views: 1741
Reputation: 1008
I am not sure spark would be the best way to load that much of data into solr.
Your possible options for loading data into solr are :
Let me know if you want more information on any of these.
Upvotes: 1