Juan Carlos Oropeza
Juan Carlos Oropeza

Reputation: 48197

PostgreSQL how to split a query between multiple CPU

I have a store procedure

DO_STUFF(obj rowFromMyTable) 

This take obj and process some data and save the result in an independent table. So the order i process the objects isn't important.

DO_STUFF(objA); DO_STUFF(objB); < == >  DO_STUFF(objB); DO_STUFF(objA);

The thing is want create a store procedure to process all object, but this use only a single CPU.

for each obj in (SELECT obj from tblSOURCE)
loop
    DO_STUFF(obj);
end loop;

I want to split the process in multiple CPU so things finish faster.
The only thing i think of was using 2 pgAdmin window and run two different store procedure in each one.

--one window run using the filter
(SELECT obj from tblSOURCE where id between 1 and 100000)

--and the other use
(SELECT obj from tblSOURCE where id between 100001 and 200000)

Any ideas of how should i do this in a single store procedure?

Upvotes: 2

Views: 5826

Answers (2)

Robins Tharakan
Robins Tharakan

Reputation: 2473

Two ways to do this (works in any of Windows / Linux / Mac):

  • PostgreSQL 9.6+ should now be able to (automatically) parallelize your queries to some extent and then you may want to see whether you need to take the pain to split the queries yourself at all.

  • Use dblink and connect to the database via multiple callbacks. The best part about DBLink is that these can be fire-n-forget (i.e. asynchronous) calls and so can be called in quick succession and then eventually wait till they all complete (although you'd need to weave the wait-for-result logic yourself). However, the drawback (as is with synchronous calls) is that unless you keep track of things like process failures / timeouts etc. you may wrongly assume that since the calls went through (successfully) all data was processed, where actually its possible that some calls failed (asynchronously).

Example

SELECT * FROM dblink_send_query('testconn', 'SELECT do_stuff_wrapper(0, 5000)') AS t1;
SELECT dblink_is_busy('testconn');
SELECT * FROM dblink_get_result('testconn') AS t1(c1 TEXT, c2 TEXT, ....);

Update: Exemplify using dblink's asynchronous functions.

Upvotes: 2

khampson
khampson

Reputation: 15306

A technique I like to use to get quick multi-threading for queries is to use a combination of psql and GNU Parallel (http://www.gnu.org/software/parallel/parallel_tutorial.html) to allow for multiple psql commands to be run at once.

If you create a wrapper stored procedure containing the loop and add arguments to it to take an offset and a limit, you can then create a quick bash script (or Python, Perl, et al) to generate the series of psql commands that are needed.

The file containing the commands can be piped into parallel and either take all the CPUs available, or a number you determine (I often like to use 4 CPUs, so as to also keep a lid on I/O on the box, but it would depend on the hardware you have).

Let's say the wrapper is called do_stuff_wrapper(_offset, _limit). The offset and limit would apply to the select:

select obj from tblSOURCE offset _offset limit _limit

Your generated psql command file (let's call it parallel.dat) might look something like this:

psql -X -h HOST -U user database -c "select do_stuff_wrapper(0, 5000);"
psql -X -h HOST -U user database -c "select do_stuff_wrapper(5001, 5000);"
psql -X -h HOST -U user database -c "select do_stuff_wrapper(10001, 5000);"

and so on.

Then you can run the commands like this:

cat parallel.dat | parallel -j 4 {}

To get multiple psql commands running in concert. Parallel will also pipeline the IO (if any, such as NOTICE's, etc.) for you such that it ends up in command order.

Edit: If you're running on Windows, you could perhaps install Cygwin, and then use parallel from there. Another, pure-Windows option would be to look into Powershell to accomplish something akin to parallel (see Can Powershell Run Commands in Parallel?).

Upvotes: 2

Related Questions