Reputation: 4000
I am currently doing performance and also memory tuning in our hibernate based app for large bulk/batch imports. We are basically importing a CSV file with product data where some products are new (insert) and some exist (update).
My focus now is on choosing a strategy to find out which entities to UPDATE and which ones to INSERT, without doing a check (Select if exists) for each row in the CSV file.
My current approach is like this:
This approach works well and test have proved it is magnitudes faster than doing such a single IF EXISTS check for every row.
My concern is memory size if there are LOTS of entities in the DB.
right now I think about using a slight variation of the approach above and I would like to know opinions.
Basically what I want to do is doing multiple batches of IF EXISTS checks with multiple rows (e.g. SELECT FROM table where sku IN (sku1, sku2, sku3)
)
Here is some pseudo code:
1. Database contains: db{sku1, sku2,sku3,sku5}
2. file contains: file {sku1, sku2, sku3, sku6}
3. Expected result:
updates: {sku1, sku2, sku3}
inserts{sku6}
4. Algorithm
have a map to keep database entities which need updates
updatemap {}
now iterate over the file in e.g. batches of 2 rows (for demo purposes)
1st iteration: foreach (select where sku IN (sku1, sku2) limit 2) as elem
-> updatemap.add(elem) -> elem is asumed to be a persistent entity here
-> myDAO.update(elem) -> executes Spring's getHibernateTemplate().update() under the hood
-> updatemap contents after 1st loop {sku1, sku2}
2nd iteration: foreach (select where sku IN (sku3, sku6) limit) 2 as elem
-> updatemap.add(elem)
-> myDAO.update(elem)
-> updatemap contents after 3nd loop {sku1, sku2, sku3}
btw: I also already assume stuff like (if i % 30 == 0) session.flush; session.clear();
Now we know all elements which were updated. All skus not in updatemap are basically inserts and we can use simple set arithmetic to determine those by doing
file {sku1, sku2, sku3, sku6} - updatemap {sku1, sku2, sku3} = newinserts {sku6}
Now we can go ahead and do inserts for the remaining CSV rows.
Conclusion My assumption is that because of the chunking of the file contents i can limit the amount of memory used. I have more SELECT statements than my initial approach but I have more control over memory usage in case there are thousands of entities in the DB already.
What are your ideas on this? What other efficient approaches exist to find out which entities to update and which to insert in bulk?
Upvotes: 5
Views: 2118
Reputation: 1235
My thoughts
1) when you do this
SELECT FROM table where sku IN (sku1, sku2, sku3) )
each query might do a full table scan when sku is not found and if you do this for the remaining entities in n passes worst case it may require n * table scans.
Perhaps a simpler approach would be create a duplicate table for all the entities in csv(may be only one column for skus and perform MINUS to get the new skus to be inserted)
select sku from dup_table
MINUS //(EXCEPT for Mysql)
select sku from table`
you may save these records in to the new table (dup_table2) and performing another MINUS on dup_table will give the skus to be updated. But these operators are db specific and i am not sure how much performance gain is seen. But IMHO looks much better option than where in
clause (esp when the csv list goes big)
Upvotes: 0
Reputation: 200296
I had the exact same problem, involving millions of records, and solved it pretty much exactly as you. A constraint that may not be obvious to a side observer is that we cannot use the regular Hibernate way of load-mutate-update since that would create an inordinate amount of redundant traffic.
On closer reading, my approach differs from yours in that I don't retain any information beyond the processing of a single chunk. I process the chunk in full, including all inserts and updates, before proceeding to the next one. Only that way you have a scalable solution.
The weakest point for me is the use of executeUpdate
, which will not use the JDBC batching API. I was planning to make a custom implementation, but for my specific use case it turned out I didn't need to use more than one executeUpdate
per chunk.
Upvotes: 2