Reputation: 1601
Interview question
Say , we have a table with 2 million records in Employee table and we need to cut 10% salary(need to do some processing) of each employee and then save it back to collection. How can you do it efficiently.
i asked him we can use executor framework for the same to create multiple threads which can fetch values from table then we can process and save it to list.
then he asked me how will you check that a record is already processed or not, there i was clueless(how to do that).
even i am not sure whether i am good with my approach or not.
please help.
Upvotes: 1
Views: 2038
Reputation: 425073
The best approach given the question as stated is to use pure SQL, something like:
update employees set
salary = salary * .9
It is very hard to imagine needing to do something to employee data that SQL could not handle.
If by some quirk of bad design you really needed to do something to employee type data that SQL absolutely could not do, then you would open a cursor to the rowset and iterate through it, making the update synchronously so you only do one pass over the data.
In pseudo code:
cursor = forUpdate ("select for update * from employees")
while (cursor.next()) {
cursor.salary = cursor.salary * .9
}
This is the simplest and likely fastest executing approach.
—-
It’s only 2M rows, which is a “small” quantity, so most DB could handle it in a single transaction. However if not, add a where clause, eg where id between <start> and <end>
to the query to chunk up the process into loggable amounts if using the shell script approach.
If using the code approach, most databases allow you to commit while holding the cursor open, so just commit every 10K rows or so.
Similar aspects to logging. All rows in such a query are locked for the duration of the transaction. Given it would take that long to run, pick a quiet time to run. If it’s really a big deal, chunk up but realise that locking is unavoidable.
Upvotes: 1
Reputation: 4443
One thing that you could do is to use a producer/consumer type model, where you have one thread working to feed the others the records to update. This way you would not have to worry as much about duplicate processing.
Upvotes: 1
Reputation: 731
I would load in this table, then add a column for the state. By default, you could set this column to "Not Processed". Once a thread starts processing this employee it would change the state to "Processing", then when finished it would finally switch it to "Processed".
Having 3 states like this would also allow you to use this as a Lock preventing the processing from happening twice.
Upvotes: 0