Reputation: 3277
Working on Solr incremental data import from an existing normalized mssql database. I'm unable to decide on the strategy I need to implement, or not knowing whether there are existing tools to do the same, so I don't need to reinvent the wheel.
I need to import a document into Solr 3.6 to build a Solr Data, which is saved in MSSQL in heavily normalized fashion. To retrieve the data for single document, there are many joins required which is killing performance. I have appx. 1 million such document in db. So full import into Solr is not an option for me.
While deciding the approach I have two issues to consider:
I am looking after your help in deciding the strategy and tool for incremental data import into Solr. I think, I have following options:
Custom develop application to fetch data from MSSQL and pass it to Solr. I need to keep track of data as what all records are inserted into Solr and what are pending. Again, 2% data records in MSSQL keeps updated on daily basis, so need to track what data has changed since then, and then update them again at some point of time into Solr.
Use any existing tool or utility in Solr to do the same, like DIH. I'm not sure how this will address both of the issue of incremental data retrieval and how it will track what data has change in SQL server? Again, not sure how DIH will handle complex joins requires to fetch data from db.
Or use something like Lusql with DIH, bust still not sure about how it will address both the issues. Although Lusql will give ability to do complex joins in db, so I hope this might fit my purpose.
I'm in favor of using LuSQL with DIH in Solr, if it can fit the purpose, but still not sure how it keep track of what data has change? Or for this part I have to manage manually by maintaining the document id where the change is made, and then supplying it to LuSQL to fetch data from SQL and import into Solr.
I am also looking forward for your suggestions beyond this to handle this kind of situations.
Upvotes: 1
Views: 2606
Reputation: 1768
We had to index from a heavily normalized schema with 25+ tables, half of them contain over 5M records each. Largest ~20M.
We use informatica to load these records from oracle to solr. ETL tools like informatica provide ways to join tables/results of a query outside the relational database. It has a sorter transformation to sort out side database. An aggregate transformation to group by records outside db. There is also a lookup transformation..
Essentially, our data is de-normalized in stages and loading/indexing process is distributed.
There are open source ETL tools of course. There is a Microsoft ETL tool..
Indexing to solr happens through update handler.. Delta indexing is very similar to full indexing with additional logic for change data capture. ETL activity is scheduled.
Upvotes: 1
Reputation: 2542
I will share with you the way i do this.
Mainly I have the same requierements and until this week I used solr dataimport with delta imports. I have a program that regularly updates a status for the new items from 0 to 1 and then calls solr data import to get all the documents with status 1. Solrdataimport uses a stored procedure to join and get the documents with status 1 from db. If the import finish successfully I then update the status to 2 and I know that this documents are in solr. If a documents get changed I simply change from status 2 to status 0 and then the import process updates the document in solr.
Everything works fine for me using this process. I always get the new documents in solr without having to fetch all the data form the database.
Now my requirements have changed because we decide to keep the date archived in the database, as we only need it in solr. So I need to have a program that deserialize the data and then have it sent to solr.
My approach now is to add all the new/updated documents via update handler and after I added all the documents to commit them, and if the commit is successful then I update the status in the database. With this approach I have no experience yet so I don't know if it will work or not but I will just try and see what happens.
I researched in the past a better way to do this but I couldn't find anything so if you find a better solution please share it with me.
Good luck :)
Upvotes: 2