What is an efficient way of handling millions of records and saving back to database?

Question

Am using Django with python 2.7. am having an excel sheet with millions of rows. I have to manipulate rows data and save back to database (postgresql). I want to do it efficiently. Following are the approaches am thinking of :

1.) enqueue all the rows (data) in a queue (preferably RabbitMQ) and will fetch with a bunch of 100 entries at once. and will execute and will save it in database.

2.) thinking of using thread in background which will be managing 100 rows by each thread and will save back the result to database. I'm not sure how many database connections will be opened in this scenario.

Can you please suggest me an efficient way to achieve this. it will be really very helpful.

little_birdie · Accepted Answer

You can create additional threads within your Django request handler, and each one will have its own database connection. But the question then becomes, how much will your database insert performance improve with multiple theads?

For example, if your table has unique constraints, this can slow down concurrent writes. Then you may find that your real bottleneck is disk bandwidth and that you will not be able to scale much by adding connections.

So you might want to write some quick-and-dirty test code to try multi-threaded concurrent writes to the table/database in question before you go too far with optimization.

As far as how you get the data efficiently divided between threads, that depends upon the input format.

If you are dealing with a Microsoft format file such as .xls then you will want to use a library to parse that.. I have used xlrd with success. But this results in all your spreadsheet data in memory at one time.. you can't read it one line at a time. But assuming you can handle the memory usage, it makes your threading simple: Once you've read all the data into memory, start multiple writer threads, telling each one the range of row numbers it is responsible for writing. The main request thread can then just join the writer threads, and when they are all finished, it can return a Response to the user. However, keep in mind that if your request takes too long, the browser will time out. That's a different problem, and for that I suggest looking at another answer I wrote the other day about how to use StreamingHttpResponse:

Right way to delay file download in Django

Now, if your input format is something like a .csv file which you can read one record at a time, a possible way to handle this would be to create an in-memory queue using the python Queue class ( https://docs.python.org/2/library/queue.html ). Start your writer threads and have them listen on the queue for records to write to the database, then have your main thread read from the .csv file one record at a time and put those records in the queue.

These suggestions are all toward processing your data right within the Django request. But if you don't want to do that, yes you can offload processing in various ways. You can certainly use rabbitmq as you mentioned, and have multiple listener processes doing the writing. This will work, but I'm not sure it will be optimally efficient. You need to write ALL the records, so it's not necessarily helpful to break them up, ship them to another process, then have them shipped from there to yet another process... Unless these other processes are running on other machines.

If you're reading the data off of a file that's already written to disk, and it's a file format that's easily divisible (again, such as CSV), an easy (and very classic) way to do this would be to divide the size of the file by the number of writer threads you have. Tell each writer the start and end offset of the file you'd like it to process. Each writer (except the one starting at offset 0) can seek forward until it finds a record delimiter (eg. ). Then it begins reading and processing input records one at a time until it reads to a position equal or past its ending offset.

If you're reading the file from the Request, and you want to process it outside the request handler, you might find that it's more efficient to just write it to disk as one big file as it comes in, then process it afterward as described above.

In summary, try to find ways to handle the data that involve the least number of reads and writes. Process data in-place if you can, don't move it around if you don't have to. Avoid writing to other processes if they're on the same machine, since the process at hand is disk-intensive. If you do want the ability to easily scale to more machines, then sure, try rabbitmq. I use it, it's good, and fast. But it'll add overhead, so unless you are getting some real benefit from breaking things up, it might slow you down. It does make interprocess communication super-easy though.

What is an efficient way of handling millions of records and saving back to database?

Answers (1)

Related Questions