Audree Steinberg
Audree Steinberg

Reputation: 23

How can I optimize Postgresql ARRAY_AGG queries for large tables?

I am using PostgreSQL for its array functionality. Here's my schema:

CREATE TABLE questions (
  id INTEGER PRIMARY KEY,
  product_id INTEGER UNIQUE NOT NULL,
  body VARCHAR(1000) NOT NULL,
  date_written DATE NOT NULL DEFAULT current_date,
  asker_name VARCHAR(60) NOT NULL,
  asker_email VARCHAR(60) NOT NULL,
  reported BOOLEAN DEFAULT FALSE,
  helpful INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE answers (
  id PRIMARY KEY NOT NULL,
  question_id INTEGER NOT NULL,
  body VARCHAR(1000) NOT NULL,
  date_written DATE NOT NULL DEFAULT current_date,
  answerer_name VARCHAR(60) NOT NULL,
  answerer_email VARCHAR(60) NOT NULL,
  reported BOOLEAN DEFAULT FALSE,
  helpful INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE photos (
  id INTEGER UNIQUE,
  answer_id INTEGER NOT NULL,
  photo VARCHAR(200)
);

I am trying to query my answers table to get a list of all the answers for a given question id and include an array of all photos that exist for that given answer_id. The results should be sorted in descending order of helpfulness. So far, I have a massive query that displays the results I'm looking for, but the execution time is 729.595 ms. I am trying to optimize to get the query's time down to 200 ms. I have the following indexes to try and optimize my query times:

  indexname    |                                 indexdef                                  
-----------------+---------------------------------------------------------------------------
 answer_id       | CREATE UNIQUE INDEX answer_id ON public.answers USING btree (id)
 question_id     | CREATE INDEX question_id ON public.answers USING btree (question_id)
 idx_reported_id | CREATE INDEX idx_reported_id ON public.answers USING btree (reported, id)
 answers_pkey    | CREATE UNIQUE INDEX answers_pkey ON public.answers USING btree (id)
  indexname    |                                  indexdef                                  
----------------+----------------------------------------------------------------------------
 id             | CREATE UNIQUE INDEX id ON public.questions USING btree (id)
 idx_q_reported | CREATE INDEX idx_q_reported ON public.questions USING btree (id, reported)
 questions_pkey | CREATE UNIQUE INDEX questions_pkey ON public.questions USING btree (id)
   indexname   |                              indexdef                               
---------------+---------------------------------------------------------------------
 photos_id_key | CREATE UNIQUE INDEX photos_id_key ON public.photos USING btree (id)
 p_links       | CREATE INDEX p_links ON public.photos USING btree (photo)

In my analysis, I noticed that the GroupAggregate is time-consuming: GroupAggregate (cost=126222.21..126222.71 rows=25 width=129) (actual time=729.497..729.506 rows=5 loops=1) Group Key: answers.id

Is there a way I can avoid the time-consuming GROUP BY? Am I missing something with the indexes? Here is the query itself:

SELECT answers.id, 
       question_id, 
       body, 
       date_written, 
       answerer_name, 
       answerer_email, 
       reported, 
       helpful, 
       ARRAY_AGG(photo) as photos 
FROM answers 
  LEFT JOIN photos ON answers.id = photos.answer_id 
WHERE reported IS 
false AND answers.id IN (SELECT id 
                         FROM answers 
                         WHERE question_id = 20012) 
GROUP BY answers.id 
ORDER BY helpful DESC;

Thanks!

Upvotes: 1

Views: 802

Answers (2)

user330315
user330315

Reputation:

One way that often works, is to aggregate first, then join on the result (rather than aggregating the full result). And you don't really need the IN condition either

SELECT a.id, 
       a.question_id, 
       a.body, 
       a.date_written, 
       a.answerer_name, 
       a.answerer_email, 
       a.reported, 
       a.helpful, 
       p.photos 
FROM answers a
  LEFT JOIN (
    select answer_id, array_agg(photo) as photos 
    from photos
    group by answer_id
  ) p ON a.id = p.answer_id 
WHERE reported IS false 
  AND a.question_id = 20012
ORDER BY a.helpful DESC;

Upvotes: 1

nachospiu
nachospiu

Reputation: 2039

I think you can skip the subquery:

SELECT answers.id, question_id, body, date_written, answerer_name, answerer_email, reported, helpful, ARRAY_AGG(photo) as photos 
FROM answers 
LEFT JOIN photos ON answers.id = photos.answer_id 
WHERE reported IS false AND question_id = 20012 
GROUP BY answers.id, question_id, body, date_written, answerer_name, answerer_email, reported, helpful
ORDER BY helpful DESC;

You can add a btree index on photos.answer_id because this field is use in the join clause.

You losed same fields on the GROUP BY clause;

Upvotes: 1

Related Questions