PostgreSQL: out of memory issues with array_agg() on a heroku db server

Question

I'm stuck with a (Postgres 9.4.6) query that a) is using too much memory (most likely due to array_agg()) and also does not return exactly what I need, making post-processing necessary. Any input (especially regarding memory consumption) is highly appreciated.

Explanation: the table token_groups holds all words used in tweets I've parsed with their respective occurrence frequency in a hstore, with one row per 10 minutes (for the last 7 days, so 7*24*6 rows in total). These rows are inserted in order of tweeted_at, so I can simply order by id. I'm using row_numberto identify when a word occurred.

# \d token_groups
                                     Table "public.token_groups"
   Column   |            Type             |                         Modifiers
------------+-----------------------------+-----------------------------------------------------------
 id         | integer                     | not null default nextval('token_groups_id_seq'::regclass)
 tweeted_at | timestamp without time zone | not null
 tokens     | hstore                      | not null default ''::hstore
Indexes:
    "token_groups_pkey" PRIMARY KEY, btree (id)
    "index_token_groups_on_tweeted_at" btree (tweeted_at)

What I'd ideally want is a list of words with each the relative distances of their row numbers. So if e.g. the word 'hello' appears in row 5 once, in row 8 twice and in row 20 once, I'd want a column with the word, and an array column returning {5,3,0,12}. (meaning: first occurrence in fifth row, next occurrence 3 rows later, next occurrence 0 rows later, next 12 rows later). If anyone wonders why: 'relevant' words occur in clusters, so (simplified) the higher the standard deviation of timely distances, the more likely a word is a keyword. See more here: http://bioinfo2.ugr.es/Publicaciones/PRE09.pdf

For now, I return an array with positions and an array with frequencies, and use this info to calculate the distances in ruby.

Currently the primary problem is a high memory spike, which seems to be caused by array_agg(). As I'm being told by the (very helpful) heroku staff that some of my connections use 500-700MB with very little shared memory, causing out of memory errors (I'm running Standard-0, which gives me 1GB total for all connections), I need to find an optimization.

The total number of hstore entries is ~100k, which then is aggregated (after skipping words with very low frequency):

SELECT COUNT(*)
FROM (SELECT row_number() over(ORDER BY id ASC) AS position,
            (each(tokens)).key, (each(tokens)).value::integer
      FROM   token_groups) subquery;

 count
--------
 106632

Here is the query causing the memory load:

SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
  SELECT row_number() over(ORDER BY id ASC) AS pos, 
         (each(tokens)).key, 
         (each(tokens)).value::integer 
  FROM token_groups
  ) subquery 
GROUP BY key
HAVING SUM(value) > 10;

The output is:

     key     |                        positions                        |           frequencies
-------------+---------------------------------------------------------+-------------------------------
 hello       | {172,185,188,210,349,427,434,467,479}                   | {1,2,1,1,2,1,2,1,4}
 world       | {166,218,265,343,415,431,436,493}                       | {1,1,2,1,2,1,2,1}
 some        | {35,65,101,180,193,198,223,227,420,424,427,428,439,444} | {1,1,1,1,1,1,1,2,1,1,1,1,1,1}
 other       | {77,111,233,416,421,494}                                | {1,1,4,1,2,2}
 word        | {170,179,182,184,185,186,187,188,189,190,196}           | {3,1,1,2,1,1,1,2,5,3,1}
(...)

Here's what explain says:

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=12789.00..12792.50 rows=200 width=44) (actual time=309.692..343.064 rows=2341 loops=1)
   Output: ((each(token_groups.tokens)).key), array_agg((row_number() OVER (?))), array_agg((((each(token_groups.tokens)).value)::integer))
   Group Key: (each(token_groups.tokens)).key
   Filter: (sum((((each(token_groups.tokens)).value)::integer)) > 10)
   Rows Removed by Filter: 33986
   Buffers: shared hit=2176
   ->  WindowAgg  (cost=177.66..2709.00 rows=504000 width=384) (actual time=0.947..108.157 rows=106632 loops=1)
         Output: row_number() OVER (?), (each(token_groups.tokens)).key, ((each(token_groups.tokens)).value)::integer, token_groups.id
         Buffers: shared hit=2176
         ->  Sort  (cost=177.66..178.92 rows=504 width=384) (actual time=0.910..1.119 rows=504 loops=1)
               Output: token_groups.id, token_groups.tokens
               Sort Key: token_groups.id
               Sort Method: quicksort  Memory: 305kB
               Buffers: shared hit=150
               ->  Seq Scan on public.token_groups  (cost=0.00..155.04 rows=504 width=384) (actual time=0.013..0.505 rows=504 loops=1)
                     Output: token_groups.id, token_groups.tokens
                     Buffers: shared hit=150
 Planning time: 0.229 ms
 Execution time: 570.534 ms

PS: if anyone wonders: every 10 minutes I append new data to the token_groupstable and remove outdated data. Which is easy when storing data one row per 10 minutes, I still have to come up with a better data structure that e.g. uses one row per word. But that does not seem to be the main issue, I think it's the array aggregation.

Erwin Brandstetter · Accepted Answer

Your presented query can be simpler, evaluating each() only once per row:

SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM  (
   SELECT t.key, pos, t.value::int
   FROM  (SELECT row_number() OVER (ORDER BY id) AS pos, * FROM token_groups) tg
        , each(g.tokens) t  -- implicit LATERAL join
   ORDER  BY t.key, pos
   ) sub
GROUP  BY key
HAVING sum(value) > 10;

Also preserving correct order of elements.

What I'd ideally want is a list of words with each the relative distances of their row numbers.

This would do it:

SELECT key, array_agg(step) AS occurrences
FROM  (
   SELECT key, CASE WHEN g = 1 THEN pos - last_pos ELSE 0 END AS step
   FROM  (
      SELECT key, value::int, pos
           , lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos) AS last_pos
      FROM  (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) tg
           , each(g.tokens) t
      ) t1
      , generate_series(1, t1.value) g
   ORDER  BY key, pos, g
   ) sub
GROUP  BY key;
HAVING count(*) > 10;

SQL Fiddle.

Interpreting each hstore key as a word and the respective value as number of occurrences in the row (= for the last 10 minutes), I use two cascading LATERAL joins: 1st step to decompose the hstore value, 2nd step to multiply rows according to value. (If your value (frequency) is mostly just 1, you can simplify.) About LATERAL:

What is the difference between LATERAL and a subquery in PostgreSQL?

Then I ORDER BY key, pos, g in the subquery before aggregating in the outer SELECT. This clause seems to be redundant, and in fact, I see the same result without it in my tests. That's a collateral benefit from the window definition of lag() in the inner query, which is carried over to the next step unless any other step triggers re-ordering. However, now we depend on an implementation detail that's not guaranteed to work.

Ordering the whole query once should be substantially faster (and easier on the required sort memory) than per-aggregate sorting. This is not strictly according to the SQL standard either, but the simple case is documented for Postgres:

Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
But this syntax is not allowed in the SQL standard, and is not portable to other database systems.

Strictly speaking, we only need:

ORDER BY pos, g

You could experiment with that. Related:

PostgreSQL unnest() with element number

Possible alternative:

SELECT key
     , ('{' || string_agg(step || repeat(',0', value - 1), ',') || '}')::int[] AS occurrences
FROM (
   SELECT key, pos, value::int
        ,(pos - lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos))::text AS step
   FROM  (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) g
        , each(g.tokens) t
   ORDER  BY key, pos
   ) t1
GROUP  BY key;
-- HAVING sum(value) > 10;

Might be cheaper to use text concatenation instead of generate_series().

PostgreSQL: out of memory issues with array_agg() on a heroku db server

Answers (1)

Related Questions