Pajak Leon
Pajak Leon

Reputation: 341

How to handle joins between huge tables in PostgreSQL?

I have two tables: urls (table with indexed pages, host is indexed column, 30 mln rows) hosts (table with information about hosts, host is indexed column, 1mln rows)

One of the most frequent SELECT in my application is:

SELECT urls.* FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
    AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?

In projects which have more than 100 000 rows in urls table the query executes very slow.

Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables but changing my database from PgSQL to MongoDB is for me big issue. Right now i would like try to optimize PgSQL solution. Do you have any advice for? What should i do?

Upvotes: 2

Views: 9456

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657932

This query should be fast in combination with the provided indexes:

CREATE INDEX hosts_host_idx ON hosts (host)
WHERE is_spam IS NULL;

CREATE INDEX urls_projects_id_idx ON urls (projects_id, id DESC);

SELECT *
FROM   urls u
WHERE  u.projects_id = ?
AND    EXISTS (
    SELECT 1
    FROM   hosts h USING (host)
    WHERE  h.is_spam IS NULL
    )
ORDER  BY urls.id DESC
LIMIT  ?;

The indexes are the more important ingredient. The JOIN syntax as you have it may be just as fast. Note that the first index is a partial index and the second is a multicolumn index with DESC order on the second column.

It much depends on specifics of your data distribution, you will have to test (as always) with EXPLAIN ANALYZE to find out about performance and whether the indexes are used.

General advice about performance optimization applies, too. You know the drill.

Upvotes: 2

Jirka Hanika
Jirka Hanika

Reputation: 13529

Add an index on the hosts.host column (primarily in the hosts table, this matters), and a composite index on urls.projects_id, urls.id, run ANALYZE statement to update all statistics and observe subsecond performance regardless of spam percentage.

A slightly different advice would apply if almost everything is always spam and if the "projects", whatever they are, are few in number and and very big each.

Explanation: update of statistics makes it possible for the optimizer to recognize that the urls and hosts tables are both quite big (well, you didn't show us schema, so we don't know your row sizes). The composite index starting with projects.id will hopefully1 rule out most of the urls content, and its second component will immediately feed the rest of urls in the desired order, so it is quite likely that an index scan of urls will be the basis for the query plan chosen by the planner. It is then essential to have an index on hosts.host to make the hosts lookups efficient; the majority of this big table will never be accessed at all.


1) Here is where we assume that the projects_id is reasonably selective (that it is not the same value throughout the whole table).

Upvotes: 0

Related Questions