Python cx_oracle concurrent fetch

When using Python cx_Oracle - we set a parameter cursor.arraysize = 10000. I am assuming that this means the Python client running on the server receives data "sequentially" in batches of 10K from the Oracle database. Let's say that we are pulling 500Million records and the database can handle the load and the server on which Python is running has enough resources to handle and no issues with network bandwidth. How can we still keep it as 10k but parallelize the fetch to concurrently pull the data, concurrency of 15 let's say. After pulling 10k records, we create files and ftp the files into an object storage. All I am trying to figure out is how to implement concurrency here while fetching the data from the database.

Upvotes: 0

Answers (1)

Paul W

Reputation: 11538

The arraysize sets the fetch size - the number of rows in each fetch operation. Every fetch requires a round trip to the database (several TCP packets back and forth) which accumulates network latency. If the fetch size is too small and your network is rather slow, you can end up spending most of your time on network latency rather than on useful work.

However, you are still limited by the speed of a single process consuming a data stream on the client side. If your data volume being fetched is truly massive, concurrent fetches may be your answer to improve throughput. Doing so will require having multiple python processes running concurrently, each pulling a portion of the data. You however will need to instrument this with your own programmatic controls - forking the child processes, assigning a thread number to each, watching them from a parent to see when they are done, checking status and handling errors in one of the children, etc. It requires some pipework, but it can definitely be done.

However, folks often don't give enough thought to the database side of things. If you are single-threaded, once your SQL execution is complete and the client begins to issue fetch calls, the database is pretty idle - all the time is on network and the client process. But if you have concurrent processes, this can change. The bigger issue however isn't the fetching, but the SQL execution itself. You wouldn't want 15 separate database sessions all scanning the same tables and doing the same joins, inefficiently repeating work by each thread that all the other threads are doing already. This can easily cause too much load on the database, especially if each session is getting parallel query (PQ) in the database.

The best way to handle this is to partition your table(s) by thread number, or by some value (like a virtual column defined as a MOD on a numeric PK column) that each thread can associate itself with. Then have each thread issue its SQL and isolate (partition prune) its table scans to the thread value in question. If each session is reading a separate partition there's no duplicate work occurring, and you can ramp up the concurrency quite a bit to get maximum throughput without overloading the database.

If you can't do that, the next best thing is to use a create-table-as-select (CTAS) and populate a work table with your SQL (assuming it has joins and does work other than simply dumping one table). Define the partitioning to align with your threads in the CTAS operation itself so that it creates a properly partitioned table and loads it in one operation. Also be sure to use the parallel (degree ...) property to leverage PX in the database. Do this up front in your parent process before you fork the child threads. Then disconnect and unleash all your children threads, each hitting one of the partitions of the work table. The downside is you have to write the data twice (once in the database, again on your client), but at least you aren't duplicating SQL work, and that means you can ramp up the concurrency without putting the database at risk.

If you do this right, you can obtain higher throughput than a single process can give you. But you have to do it right. If you simply repeat the same heavy SQL work in each thread your DBA will shut you down and tell you to limit your concurrency. One of the techniques above can avoid that and keep everyone happy.

Upvotes: 2

Python cx_oracle concurrent fetch

Answers (1)

Related Questions