postgres text field performance

Question

I am using postgres 9.2.4.

We have a background job that imports user's emails into our system and stores them in a postgres database table.

Below is the table:

CREATE TABLE emails
(
  id serial NOT NULL,
  subject text,
  body text,
  personal boolean,
  sent_at timestamp without time zone NOT NULL,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  account_id integer NOT NULL,
  sender_user_id integer,
  sender_contact_id integer,
  html text,
  folder text,
  draft boolean DEFAULT false,
  check_for_response timestamp without time zone,
  send_time timestamp without time zone,
  CONSTRAINT emails_pkey PRIMARY KEY (id),
  CONSTRAINT emails_account_id_fkey FOREIGN KEY (account_id)
      REFERENCES accounts (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE,
  CONSTRAINT emails_sender_contact_id_fkey FOREIGN KEY (sender_contact_id)
      REFERENCES contacts (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
  OIDS=FALSE
);
ALTER TABLE emails
  OWNER TO paulcowan;

-- Index: emails_account_id_index

-- DROP INDEX emails_account_id_index;

CREATE INDEX emails_account_id_index
  ON emails
  USING btree
  (account_id);

-- Index: emails_sender_contact_id_index

-- DROP INDEX emails_sender_contact_id_index;

CREATE INDEX emails_sender_contact_id_index
  ON emails
  USING btree
  (sender_contact_id);

-- Index: emails_sender_user_id_index

-- DROP INDEX emails_sender_user_id_index;

CREATE INDEX emails_sender_user_id_index
  ON emails
  USING btree
  (sender_user_id);

The query is further complicated because I have a view on this table where I pull in other data:

CREATE OR REPLACE VIEW email_graphs AS 
 SELECT emails.id, emails.subject, emails.body, emails.folder, emails.html, 
    emails.personal, emails.draft, emails.created_at, emails.updated_at, 
    emails.sent_at, emails.sender_contact_id, emails.sender_user_id, 
    emails.addresses, emails.read_by, emails.check_for_response, 
    emails.send_time, ts.ids AS todo_ids, cs.ids AS call_ids, 
    ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people, 
    atts.ids AS attachment_ids
   FROM emails
   LEFT JOIN ( SELECT todos.reference_email_id AS email_id, 
            array_to_json(array_agg(todos.id)) AS ids
           FROM todos
          GROUP BY todos.reference_email_id) ts ON ts.email_id = emails.id
   LEFT JOIN ( SELECT calls.reference_email_id AS email_id, 
       array_to_json(array_agg(calls.id)) AS ids
      FROM calls
     GROUP BY calls.reference_email_id) cs ON cs.email_id = emails.id
   LEFT JOIN ( SELECT deals.reference_email_id AS email_id, 
    array_to_json(array_agg(deals.id)) AS ids
   FROM deals
  GROUP BY deals.reference_email_id) ds ON ds.email_id = emails.id
   LEFT JOIN ( SELECT meetings.reference_email_id AS email_id, 
    array_to_json(array_agg(meetings.id)) AS ids
   FROM meetings
  GROUP BY meetings.reference_email_id) ms ON ms.email_id = emails.id
   LEFT JOIN ( SELECT comments.email_id, 
    array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
           FROM ( VALUES (comments.id,comments.text,comments.author_id,comments.created_at,comments.updated_at)) r(id, text, author_id, created_at, updated_at)))) AS comments
   FROM comments
  WHERE comments.email_id IS NOT NULL
  GROUP BY comments.email_id) c ON c.email_id = emails.id
   LEFT JOIN ( SELECT email_participants.email_id, 
    array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
           FROM ( VALUES (email_participants.user_id,email_participants.contact_id,email_participants.kind)) r(user_id, contact_id, kind)))) AS people
   FROM email_participants
  GROUP BY email_participants.email_id) p ON p.email_id = emails.id
   LEFT JOIN ( SELECT attachments.reference_email_id AS email_id, 
    array_to_json(array_agg(attachments.id)) AS ids
   FROM attachments
  GROUP BY attachments.reference_email_id) atts ON atts.email_id = emails.id;

ALTER TABLE email_graphs
  OWNER TO paulcowan;

We then run paginated queries against this view e.g.

SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0

As the table has grown, queries on this table have dramatically slowed down.

If I run the paginated query with EXPLAIN ANALYZE

EXPLAIN ANALYZE SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0;

I get this result

                                                           ->  Seq Scan on deals  (cost=0.00..9.11 rows=36 width=8) (actual time=0.003..0.044 rows=34 loops=1)
                                   ->  Sort  (cost=5.36..5.43 rows=131 width=36) (actual time=0.416..0.416 rows=1 loops=1)
                                         Sort Key: ms.email_id
                                         Sort Method: quicksort  Memory: 26kB
                                         ->  Subquery Scan on ms  (cost=3.52..4.44 rows=131 width=36) (actual time=0.408..0.411 rows=1 loops=1)
                                               ->  HashAggregate  (cost=3.52..4.05 rows=131 width=8) (actual time=0.406..0.408 rows=1 loops=1)
                                                     ->  Seq Scan on meetings  (cost=0.00..3.39 rows=131 width=8) (actual time=0.006..0.163 rows=161 loops=1)
                             ->  Sort  (cost=18.81..18.91 rows=199 width=36) (actual time=0.012..0.012 rows=0 loops=1)
                                   Sort Key: c.email_id
                                   Sort Method: quicksort  Memory: 25kB
                                   ->  Subquery Scan on c  (cost=15.90..17.29 rows=199 width=36) (actual time=0.007..0.007 rows=0 loops=1)
                                         ->  HashAggregate  (cost=15.90..16.70 rows=199 width=60) (actual time=0.006..0.006 rows=0 loops=1)
                                               ->  Seq Scan on comments  (cost=0.00..12.22 rows=736 width=60) (actual time=0.004..0.004 rows=0 loops=1)
                                                     Filter: (email_id IS NOT NULL)
                                                     Rows Removed by Filter: 2
                                               SubPlan 1
                                                 ->  Values Scan on "*VALUES*"  (cost=0.00..0.00 rows=1 width=56) (never executed)
                       ->  Materialize  (cost=4220.14..4883.55 rows=27275 width=36) (actual time=247.720..1189.545 rows=29516 loops=1)
                             ->  GroupAggregate  (cost=4220.14..4788.09 rows=27275 width=15) (actual time=247.715..1131.787 rows=29516 loops=1)
                                   ->  Sort  (cost=4220.14..4261.86 rows=83426 width=15) (actual time=247.634..339.376 rows=82632 loops=1)
                                         Sort Key: public.email_participants.email_id
                                         Sort Method: external sort  Disk: 1760kB
                                         ->  Seq Scan on email_participants  (cost=0.00..2856.28 rows=83426 width=15) (actual time=0.009..88.938 rows=82720 loops=1)
                                   SubPlan 2
                                     ->  Values Scan on "*VALUES*"  (cost=0.00..0.00 rows=1 width=40) (actual time=0.004..0.005 rows=1 loops=82631)
                 ->  Sort  (cost=2.01..2.01 rows=1 width=36) (actual time=0.074..0.077 rows=3 loops=1)
                       Sort Key: atts.email_id
                       Sort Method: quicksort  Memory: 25kB
                       ->  Subquery Scan on atts  (cost=2.00..2.01 rows=1 width=36) (actual time=0.048..0.060 rows=3 loops=1)
                             ->  HashAggregate  (cost=2.00..2.01 rows=1 width=8) (actual time=0.045..0.051 rows=3 loops=1)
                                   ->  Seq Scan on attachments  (cost=0.00..2.00 rows=1 width=8) (actual time=0.013..0.021 rows=5 loops=1)
           ->  Index Only Scan using email_participants_email_id_user_id_index on email_participants  (cost=0.00..990.04 rows=269 width=4) (actual time=1.357..2.886 rows=43 loops=1)
                 Index Cond: (user_id = 75)
                 Heap Fetches: 43

Total runtime: 1642.157 ms (75 rows)

Denis de Bernardy · Accepted Answer

I am definitely not looking for fixes :) or a refactored query. Any sort of hight level advice would be most welcome.

Per my comment, the gist of the issue lies in aggregates that get joined with one another. This prevents the use of indexes, and yields a bunch of merge joins (and a materialize) in your query plan…

Put another way, think of it as a plan so convoluted that Postgres proceeds by materializing temporary tables in memory, and then sorting them repeatedly until they're all merge-joined as appropriate. From where I'm standing, the full hogwash seems to amount to selecting all rows from all tables, and all of their possible relationships. Once it has been figured out and conjured up into existence, Postgres proceeds to sorting the mess in order to extract the top-n rows.

Anyway, you want rewrite the query so it can actually use indexes to begin with.

Part of this is simple. This, for instance, is a big no no:

select …,
       ts.ids AS todo_ids, cs.ids AS call_ids, 
       ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people, 
       atts.ids AS attachment_ids

Fetch the emails in one query. Fetch the related objects in separate queries with a bite-sized email_id in (…) clause. Merely doing that should speed things up quite a bit.

For the rest, it may or may not be simple or involve some re-engineering of your schema. I only scanned through the incomprehensible monster and its gruesome query plan, so I cannot comment for sure.

postgres text field performance

Answers (2)

Related Questions