Nikhil Verma
Nikhil Verma

Reputation: 1730

Best way to insert a lot of Data in Solr

I have some data which I have to ingest every day into Solr, per day data is around 10-12 GB, and I have to run a catch-up job for last 1 year, every day is around 10-12 GB data.

I am using Java and I need scoring in my data by doing a partial update if same unique key arrives again, I used docValues with TextField.

https://github.com/grossws/solr-dvtf

Initially, I used a sequential approach which took a lot of time(reading from S3 and adding to Solr in batches of 60k).

I found this repo:

https://github.com/lucidworks/spark-solr,

but I couldn't understand the implementation as I needed to modify field data for some scoring logic, so wrote custom spark code.

Then I created 4 nodes in Solr(on the same IP), and used Spark to insert data, initially as the partitions created by Spark were way more than the Solr nodes and also the 'executors' specified were more than nodes, so it took way much more time.

Then I repartitioned the RDD into 4(no. of Solr nodes), specified 4 executors, then insertion took less time and was successful, but when I ran the same for a month, one or more Solr nodes kept on going down, I have enough free space on HD, and rarely my ram usage ends up being full.

Please suggest me a way to solve this problem, and I have 8 core CPU, or should I use a different system for different nodes on Solr?

Thanks!

Upvotes: 2

Views: 1741

Answers (1)

Sanchit Grover
Sanchit Grover

Reputation: 1008

I am not sure spark would be the best way to load that much of data into solr.

Your possible options for loading data into solr are :

  1. Through hbase-indexer also called batch indexer which syncs data between your hbase table and solr index.
  2. You can also implement an hbase-lily-indexer which is almost in real time.
  3. You can also use solr's jdbc utility - THE BEST in my opinion. What you can do is read data from s3 load into an hive table through spark. Then you can implement a solr jdbc to your hive table and trust me it is very fast.

Let me know if you want more information on any of these.

Upvotes: 1

Related Questions