Best way to insert a lot of Data in Solr

Question

I have some data which I have to ingest every day into Solr, per day data is around 10-12 GB, and I have to run a catch-up job for last 1 year, every day is around 10-12 GB data.

I am using Java and I need scoring in my data by doing a partial update if same unique key arrives again, I used docValues with TextField.

https://github.com/grossws/solr-dvtf

Initially, I used a sequential approach which took a lot of time(reading from S3 and adding to Solr in batches of 60k).

I found this repo:

https://github.com/lucidworks/spark-solr,

but I couldn't understand the implementation as I needed to modify field data for some scoring logic, so wrote custom spark code.

Then I created 4 nodes in Solr(on the same IP), and used Spark to insert data, initially as the partitions created by Spark were way more than the Solr nodes and also the 'executors' specified were more than nodes, so it took way much more time.

Then I repartitioned the RDD into 4(no. of Solr nodes), specified 4 executors, then insertion took less time and was successful, but when I ran the same for a month, one or more Solr nodes kept on going down, I have enough free space on HD, and rarely my ram usage ends up being full.

Please suggest me a way to solve this problem, and I have 8 core CPU, or should I use a different system for different nodes on Solr?

Thanks!

Best way to insert a lot of Data in Solr

Answers (1)

Related Questions