How to handle joins between huge tables in PostgreSQL?

Question

I have two tables: urls (table with indexed pages, host is indexed column, 30 mln rows) hosts (table with information about hosts, host is indexed column, 1mln rows)

One of the most frequent SELECT in my application is:

SELECT urls.* FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
    AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?

In projects which have more than 100 000 rows in urls table the query executes very slow.

Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables but changing my database from PgSQL to MongoDB is for me big issue. Right now i would like try to optimize PgSQL solution. Do you have any advice for? What should i do?

Jirka Hanika · Accepted Answer

Add an index on the hosts.host column (primarily in the hosts table, this matters), and a composite index on urls.projects_id, urls.id, run ANALYZE statement to update all statistics and observe subsecond performance regardless of spam percentage.

A slightly different advice would apply if almost everything is always spam and if the "projects", whatever they are, are few in number and and very big each.

Explanation: update of statistics makes it possible for the optimizer to recognize that the urls and hosts tables are both quite big (well, you didn't show us schema, so we don't know your row sizes). The composite index starting with projects.id will hopefully¹ rule out most of the urls content, and its second component will immediately feed the rest of urls in the desired order, so it is quite likely that an index scan of urls will be the basis for the query plan chosen by the planner. It is then essential to have an index on hosts.host to make the hosts lookups efficient; the majority of this big table will never be accessed at all.

¹) Here is where we assume that the projects_id is reasonably selective (that it is not the same value throughout the whole table).

How to handle joins between huge tables in PostgreSQL?

Answers (2)

Related Questions