Reputation: 341
I have two tables: urls (table with indexed pages, host is indexed column, 30 mln rows) hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.* FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables but changing my database from PgSQL to MongoDB is for me big issue. Right now i would like try to optimize PgSQL solution. Do you have any advice for? What should i do?
Upvotes: 2
Views: 9456
Reputation: 657932
This query should be fast in combination with the provided indexes:
CREATE INDEX hosts_host_idx ON hosts (host)
WHERE is_spam IS NULL;
CREATE INDEX urls_projects_id_idx ON urls (projects_id, id DESC);
SELECT *
FROM urls u
WHERE u.projects_id = ?
AND EXISTS (
SELECT 1
FROM hosts h USING (host)
WHERE h.is_spam IS NULL
)
ORDER BY urls.id DESC
LIMIT ?;
The indexes are the more important ingredient. The JOIN syntax as you have it may be just as fast. Note that the first index is a partial index and the second is a multicolumn index with DESC
order on the second column.
It much depends on specifics of your data distribution, you will have to test (as always) with EXPLAIN ANALYZE to find out about performance and whether the indexes are used.
General advice about performance optimization applies, too. You know the drill.
Upvotes: 2
Reputation: 13529
Add an index on the hosts.host
column (primarily in the hosts
table, this matters), and a composite index on urls.projects_id, urls.id
, run ANALYZE
statement to update all statistics and observe subsecond performance regardless of spam percentage.
A slightly different advice would apply if almost everything is always spam and if the "projects", whatever they are, are few in number and and very big each.
Explanation: update of statistics makes it possible for the optimizer to recognize that the urls
and hosts
tables are both quite big (well, you didn't show us schema, so we don't know your row sizes). The composite index starting with projects.id
will hopefully1 rule out most of the urls
content, and its second component will immediately feed the rest of urls
in the desired order, so it is quite likely that an index scan of urls
will be the basis for the query plan chosen by the planner. It is then essential to have an index on hosts.host
to make the hosts lookups efficient; the majority of this big table will never be accessed at all.
1) Here is where we assume that the projects_id
is reasonably selective (that it is not the same value throughout the whole table).
Upvotes: 0