MikeBeaton
MikeBeaton

Reputation: 3892

Correct use of cursors for very large result sets in Postgres

Short version of my question:

If I hold a cursor reference to an astronomically huge result set in my client code, would it be ridiculous (i.e. completely defeats the point of cursors) to issue "FETCH ALL FROM cursorname" as my next command? Or would this slowly stream the data back to me as I consume it (at least in principle, assuming that I have a well written driver sitting between me and Postgres)?

More detail

If I understand things at all correctly, then Postgres cursors are REALLY for dealing with the following problem [even though they can be used (abused?) for other things, such as returning multiple different result sets from one function]:

Note: The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated.

(ref: https://www.postgresql.org/docs/9.6/static/plpgsql-control-structures.html)

But (again if I understand correctly) when you write a function which returns a cursor then the whole query is NOT buffered into memory (and disk) before the user of the function can start to consume anything, but instead the results can be consumed bit by bit. (There is more overhead setting up and using the cursor, but it's worth it to avoid massive buffer allocation for very large result sets.)

(ref: https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551)

I would like to understand how this relates to SELECTS and FETCHES over the wire to a Postgres server.

In all cases, I'm talking about consuming results from client code which is communicating with Postgres on a socket behind the scenes (using the Npgsql library in my case, actually).

Q1: What if I try to execute "SELECT * FROM AstronomicallyLargeTable" as my only command over the wire to Postgres? Will that allocate all the memory for the entire select and then start to send data back to me? Or will it (effectively) generate its own cursor and stream the data back a little at a time (with no huge additional buffer allocation on the server)?

Q2: What if I already have a cursor reference to an astronomically large result set (say because I've already done one round trip, and got back the cursor reference from some function), and then I execute "FETCH ALL FROM cursorname" over the wire to Postgres? Is that stupid, because it will allocate ALL the memory for all the results on the Postgres server before sending anything back to me? Or will "FETCH ALL FROM cursorname" actually work as I'd like it to, streaming the data back slowly as I consume it, without any massive buffer allocation happening on the Postgres server?

EDIT: Further clarification

I'm asking about a case where I know that my data access layer streams the data from the server to me one row at a time (so no large client-side buffers involved there, however long the data streams for) and where I also know that my own application consumes the data one row at a time and then discards it (so no client-side buffers there, either). I definitely DON'T want to fetch all these rows into client side memory, and then do something with them. I see that that would be completely daft!

So I think all the issues (for the just-described use case) are about how long PostgreSQL would take to start streaming and how much of a memory buffer it would allocate, for a FETCH ALL. IF (and it's a big 'IF'...) PostgreSQL doesn't allocate a huge buffer of all rows before starting, and if it streams the rows back to Npgsql one at a time, starting quickly, then I believe (but please tell me why/if I'm wrong) that there still IS a clear use case for FETCH ALL FROM cursorname!

Upvotes: 21

Views: 20921

Answers (2)

Laurenz Albe
Laurenz Albe

Reputation: 247950

After some experimenting it seems like PostgreSQL behaves like this:

  • Fetching many rows with SELECT * FROM large will not create a temporary file on the server side, the data are streamed as they are scanned.

  • If you create a server side cursor with a function that returns refcursor and fetch rows from the cursor, all returned rows are collected on the server first. This leads to the creation of a temporary file if you run FETCH ALL.

Here are my experiments with a table that contains 1000000 rows. work_mem is set to 64kb (the minimum). log_temp_files is set to 0 so that temporary files are reported in the server log.

  • First attempt:

    SELECT id FROM large;
    

    Result: no temporary file is created.

  • Second attempt:

    CREATE OR REPLACE FUNCTION lump() RETURNS refcursor
       LANGUAGE plpgsql AS
    $$DECLARE
       c CURSOR FOR SELECT id FROM large;
    BEGIN
       c := 'c';
       OPEN c;
       RETURN c;
    END;$$;
    
    BEGIN;
    SELECT lump();
     lump
    ------
     c
    (1 row)
    
    FETCH NEXT FROM c;
     id
    ----
      1
    (1 row)
    
    FETCH NEXT FROM c;
     id
    ----
      2
    (1 row)
    
    COMMIT;
    

    Result: no temporary file is created.

  • Third attempt:

    BEGIN;
    SELECT lump();
     lump
    ------
     c
    (1 row)
    
    FETCH all FROM c;
       id
    ---------
           1
           2
           3
    ...
      999999
     1000000
    (1000000 rows)
    
    COMMIT;
    

    Result: a temporary file of about 140MB is created.

I don't really know why PostgreSQL behaves that way.

Upvotes: 10

DB140141
DB140141

Reputation: 471

One thing that is missing in your question is if you really need a plpgsql function as opposed to an inlined sql function. I only bring it up because your description is a simple scenario - select * from hugetable. So I am going to answer the question based on that information.

In that case, your problem is not really a problem, because the function call can be invisible. My point is that if you can write the function as an inline SQL function, which you don't indicate one way or another, you don't need to worry about this particular limitation of plpgsql RETURN QUERY.

CREATE OR REPLACE FUNCTION foo()
RETURNS TABLE (id INT)
AS
$BODY$
SELECT * FROM bar;
$BODY$
LANGUAGE SQL STABLE;

Look at the plan:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM foo() LIMIT 1;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Seq Scan on bar  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..0.014 rows=1 loops=1)
         Buffers: shared hit=1
 Planning time: 0.082 ms
 Execution time: 0.031 ms
(6 rows)

There is no entire result set being filled out then returned.

https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions

I will defer to the other answers here if you really need plpgsql to do some non-sql foo, but this really needed to be said here.

Upvotes: 1

Related Questions