Reputation: 28910
I am using postgres 9.2.4.
We have a background job that imports user's emails into our system and stores them in a postgres database table.
Below is the table:
CREATE TABLE emails
(
id serial NOT NULL,
subject text,
body text,
personal boolean,
sent_at timestamp without time zone NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
account_id integer NOT NULL,
sender_user_id integer,
sender_contact_id integer,
html text,
folder text,
draft boolean DEFAULT false,
check_for_response timestamp without time zone,
send_time timestamp without time zone,
CONSTRAINT emails_pkey PRIMARY KEY (id),
CONSTRAINT emails_account_id_fkey FOREIGN KEY (account_id)
REFERENCES accounts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT emails_sender_contact_id_fkey FOREIGN KEY (sender_contact_id)
REFERENCES contacts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE emails
OWNER TO paulcowan;
-- Index: emails_account_id_index
-- DROP INDEX emails_account_id_index;
CREATE INDEX emails_account_id_index
ON emails
USING btree
(account_id);
-- Index: emails_sender_contact_id_index
-- DROP INDEX emails_sender_contact_id_index;
CREATE INDEX emails_sender_contact_id_index
ON emails
USING btree
(sender_contact_id);
-- Index: emails_sender_user_id_index
-- DROP INDEX emails_sender_user_id_index;
CREATE INDEX emails_sender_user_id_index
ON emails
USING btree
(sender_user_id);
The query is further complicated because I have a view on this table where I pull in other data:
CREATE OR REPLACE VIEW email_graphs AS
SELECT emails.id, emails.subject, emails.body, emails.folder, emails.html,
emails.personal, emails.draft, emails.created_at, emails.updated_at,
emails.sent_at, emails.sender_contact_id, emails.sender_user_id,
emails.addresses, emails.read_by, emails.check_for_response,
emails.send_time, ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
FROM emails
LEFT JOIN ( SELECT todos.reference_email_id AS email_id,
array_to_json(array_agg(todos.id)) AS ids
FROM todos
GROUP BY todos.reference_email_id) ts ON ts.email_id = emails.id
LEFT JOIN ( SELECT calls.reference_email_id AS email_id,
array_to_json(array_agg(calls.id)) AS ids
FROM calls
GROUP BY calls.reference_email_id) cs ON cs.email_id = emails.id
LEFT JOIN ( SELECT deals.reference_email_id AS email_id,
array_to_json(array_agg(deals.id)) AS ids
FROM deals
GROUP BY deals.reference_email_id) ds ON ds.email_id = emails.id
LEFT JOIN ( SELECT meetings.reference_email_id AS email_id,
array_to_json(array_agg(meetings.id)) AS ids
FROM meetings
GROUP BY meetings.reference_email_id) ms ON ms.email_id = emails.id
LEFT JOIN ( SELECT comments.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (comments.id,comments.text,comments.author_id,comments.created_at,comments.updated_at)) r(id, text, author_id, created_at, updated_at)))) AS comments
FROM comments
WHERE comments.email_id IS NOT NULL
GROUP BY comments.email_id) c ON c.email_id = emails.id
LEFT JOIN ( SELECT email_participants.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (email_participants.user_id,email_participants.contact_id,email_participants.kind)) r(user_id, contact_id, kind)))) AS people
FROM email_participants
GROUP BY email_participants.email_id) p ON p.email_id = emails.id
LEFT JOIN ( SELECT attachments.reference_email_id AS email_id,
array_to_json(array_agg(attachments.id)) AS ids
FROM attachments
GROUP BY attachments.reference_email_id) atts ON atts.email_id = emails.id;
ALTER TABLE email_graphs
OWNER TO paulcowan;
We then run paginated queries against this view e.g.
SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0
As the table has grown, queries on this table have dramatically slowed down.
If I run the paginated query with EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0;
I get this result
-> Seq Scan on deals (cost=0.00..9.11 rows=36 width=8) (actual time=0.003..0.044 rows=34 loops=1)
-> Sort (cost=5.36..5.43 rows=131 width=36) (actual time=0.416..0.416 rows=1 loops=1)
Sort Key: ms.email_id
Sort Method: quicksort Memory: 26kB
-> Subquery Scan on ms (cost=3.52..4.44 rows=131 width=36) (actual time=0.408..0.411 rows=1 loops=1)
-> HashAggregate (cost=3.52..4.05 rows=131 width=8) (actual time=0.406..0.408 rows=1 loops=1)
-> Seq Scan on meetings (cost=0.00..3.39 rows=131 width=8) (actual time=0.006..0.163 rows=161 loops=1)
-> Sort (cost=18.81..18.91 rows=199 width=36) (actual time=0.012..0.012 rows=0 loops=1)
Sort Key: c.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on c (cost=15.90..17.29 rows=199 width=36) (actual time=0.007..0.007 rows=0 loops=1)
-> HashAggregate (cost=15.90..16.70 rows=199 width=60) (actual time=0.006..0.006 rows=0 loops=1)
-> Seq Scan on comments (cost=0.00..12.22 rows=736 width=60) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (email_id IS NOT NULL)
Rows Removed by Filter: 2
SubPlan 1
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=56) (never executed)
-> Materialize (cost=4220.14..4883.55 rows=27275 width=36) (actual time=247.720..1189.545 rows=29516 loops=1)
-> GroupAggregate (cost=4220.14..4788.09 rows=27275 width=15) (actual time=247.715..1131.787 rows=29516 loops=1)
-> Sort (cost=4220.14..4261.86 rows=83426 width=15) (actual time=247.634..339.376 rows=82632 loops=1)
Sort Key: public.email_participants.email_id
Sort Method: external sort Disk: 1760kB
-> Seq Scan on email_participants (cost=0.00..2856.28 rows=83426 width=15) (actual time=0.009..88.938 rows=82720 loops=1)
SubPlan 2
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=40) (actual time=0.004..0.005 rows=1 loops=82631)
-> Sort (cost=2.01..2.01 rows=1 width=36) (actual time=0.074..0.077 rows=3 loops=1)
Sort Key: atts.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on atts (cost=2.00..2.01 rows=1 width=36) (actual time=0.048..0.060 rows=3 loops=1)
-> HashAggregate (cost=2.00..2.01 rows=1 width=8) (actual time=0.045..0.051 rows=3 loops=1)
-> Seq Scan on attachments (cost=0.00..2.00 rows=1 width=8) (actual time=0.013..0.021 rows=5 loops=1)
-> Index Only Scan using email_participants_email_id_user_id_index on email_participants (cost=0.00..990.04 rows=269 width=4) (actual time=1.357..2.886 rows=43 loops=1)
Index Cond: (user_id = 75)
Heap Fetches: 43
Total runtime: 1642.157 ms (75 rows)
Upvotes: 1
Views: 2659
Reputation: 61686
I think the big view is not likely to ever perform well and that you should break it into more manageable components, but still here are two specific bits of advice that come to mind:
Schema change
Move the text and html bodies out of the main table. Although large contents get automatically stored in TOAST space, mail parts will often be smaller than the TOAST threshold (~2000 bytes), especially for plain text, so it won't happen systematically.
Each non-TOASTED content inflates the table in a way that is detrimental to I/O performance and caching, if you consider that the primary purpose of the table is to contain the header fields like sender, recipients, date, subject...
I can test this with contents I happen to have in a mail database. On a sample of the 55k mails in my inbox:
average text/plain size: 1511 bytes.
average text/html size: 11895 bytes (but 42395 messages have no html at all)
Size of the mail table without the bodies: 14Mb (no TOAST)
If adding the bodies as 2 more TEXT columns like you have: 59Mb in main storage, 61Mb in TOAST.
Despite TOAST, the main storage appears to be 4 times larger. So when scanning the table without need for the TEXT columns, 80% of the I/Os are wasted. Future row updates are likely to make this worse with the fragmentation effect.
The effect in terms of block reads can be spotted through the pg_statio_all_tables
view (compare heap_blks_read + heap_blks_hit
before and after the query)
Tuning
This part of the EXPLAIN
Sort Method: external sort Disk: 1760kB
suggests that you work_mem
is too small. You don't want to hit the disk for such small sorts. Make it as least 10Mb unless you're low on free memory. While you're at it, set shared_buffers
to a reasonable value if it's still the default. See http://wiki.postgresql.org/wiki/Performance_Optimization for more.
Upvotes: 1
Reputation: 78561
I am definitely not looking for fixes :) or a refactored query. Any sort of hight level advice would be most welcome.
Per my comment, the gist of the issue lies in aggregates that get joined with one another. This prevents the use of indexes, and yields a bunch of merge joins (and a materialize) in your query plan…
Put another way, think of it as a plan so convoluted that Postgres proceeds by materializing temporary tables in memory, and then sorting them repeatedly until they're all merge-joined as appropriate. From where I'm standing, the full hogwash seems to amount to selecting all rows from all tables, and all of their possible relationships. Once it has been figured out and conjured up into existence, Postgres proceeds to sorting the mess in order to extract the top-n rows.
Anyway, you want rewrite the query so it can actually use indexes to begin with.
Part of this is simple. This, for instance, is a big no no:
select …,
ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
Fetch the emails in one query. Fetch the related objects in separate queries with a bite-sized email_id in (…)
clause. Merely doing that should speed things up quite a bit.
For the rest, it may or may not be simple or involve some re-engineering of your schema. I only scanned through the incomprehensible monster and its gruesome query plan, so I cannot comment for sure.
Upvotes: 1