Reputation: 1916
I have a MySQL database with 4 million products, which I am importing into Solr using DIH so that I may perform elaborate searches. However the data relationships mean that I actually request a lot more than four million records (e.g. one product may have many colours, etc), and it takes over 8 hours to build the index.
Is there a way to improve the performance of the indexing without using delta-queries? For example, is the performance bottleneck due to the multiple "join" conditions I am using? There are no indexing performance statistics available in Solr that I can see so it is very hard to diagnose where the performance bottleneck is.
This is my data-config.xml file:
Thanks,
<document>
<entity name="A" pk="id" query="SELECT id AS id_productByStore, id_product, id_store, ... FROM A">
<entity name="B" pk="id" query="SELECT id, cleanTitle, id_brand, ... FROM B WHERE id='${A.id_product}'">
<entity name="C" pk="id" query="SELECT name, alias FROM C WHERE id ='${B.id_brand}'"></entity>
<entity name="D" pk="id" query="SELECT name FROM D WHERE id ='${B.id_category}'"></entity>
<entity name="E" pk="id" query="SELECT gender FROM E WHERE id='${B.id_gender}'" > </entity>
<entity name="F" pk="id" query="SELECT id_colour FROM F WHERE id_colourSet='${B.id_colourSet}'">
<entity name="G" pk="id" query="SELECT title FROM G WHERE id='${F.id_colour}'" > </entity>
</entity>
</entity>
<entity name="H" pk="id" query="SELECT name FROM H WHERE id = '${A.id_store}'"></entity>
</entity>
</document>
Upvotes: 2
Views: 1881
Reputation: 1183
If your MySQL DB and Solr server are not on the same machine, you could have a network issue on your hands. The DB and Solr server at my shop aren't on the same machine and sometimes imports slow down by a lot, depending on what's going on that day.
The thing that is probably your biggest contributor is your nested entities. When Solr imports documents, it appears that Solr behaves as if nested entities are nested loops. You would probably be much better off if you could use a series of inner or right joins to bring your columns together in one query.
We used to use nested entities where I work, and imports could take hours. We were able to write a fairly complex MySQL join to replace those nested entities. Our full imports typically are around the 10 - 15 minute range, and we're pulling in about 3 or 4 million records. Deltas are in the 5 - 10 minute range. Even if you can't join all your columns, joining as many as possible and using nested entities for the rest should help your indexing time.
Upvotes: 4