KarlsFriend
KarlsFriend

Reputation: 745

Efficient select with restrictions on two, big joined tables

I have two huge tables in a DB2-Database. I have adapted the real-names to another problem to highlight the issue and obfuscate the real use case.

CREATE TABLE Author (
    id integer PRIMARY KEY NOT NULL,
    name vargraphic(32) NOT NULL,
    country vargraphic(32) NOT NULL)

CREATE TABLE Book(
    id integer PRIMARY KEY NOT NULL,
    title vargraphic(32) NOT NULL,
    authorId INTEGER NOT NULL REFERENCES Author(id),
    releaseYear Integer NOT NULL)

Both tables get so big, that doing a full-table scan is not feasible on either of them.

Here I want to find the first 10 books released in 2017, sorted by their title

SELECT b.title, b.authorId
FROM Book b 
WHERE b.RELEASEYEAR = 2017
ORDER BY b.TITLE
FETCH FIRST 10 ROWS ONLY

To speed it up I create an index

CREATE INDEX itest1 ON BOOK (releaseYear, title, id, authorid);

With this index an index-only-scan is done and the query is very fast.

Now I want to restrict the result to only those books written by american authors.

SELECT b.title, b.authorId
FROM Book b 
JOIN Author a ON (a.id = b.authorId) 
WHERE b.RELEASEYEAR = 2017
AND a.COUNTRY = 'USA'
ORDER BY b.TITLE
FETCH FIRST 10 ROWS ONLY

I try to speed this query up as well by adding more indexes:

CREATE INDEX itest3 ON AUTHOR (id, country);
CREATE INDEX itest4 ON AUTHOR (country, id);

Even tough indexe-only-scan are used, the performance is not close the the one the previous statement had.

EXPLAIN PLAN tells me that after doing an index scan on both tables an expensive Nested-Loop-Join is done. It seems that the first index is used for finding books from 2017, ordered by title and the second is used for authors by country. But if each of those subsets is still very big, creating intersection takes a long time.

What I kind of would like to do is avoid the nested-loop, by having a single index that spans two tables.

CREATE INDEX two_table_index ON Book b JOIN Author a (a.country, b.releaseYear, b.title)

But this is not possible it seems. Is there another way to effectively run the second query on very big tables?

Upvotes: 0

Views: 146

Answers (2)

Clockwork-Muse
Clockwork-Muse

Reputation: 13096

EXPLAIN PLAN tells me that after doing an index scan on both tables an expensive Nested-Loop-Join is done. It seems that the first index is used for finding books from 2017, ordered by title and the second is used for authors by country. But if each of those subsets is still very big, creating intersection takes a long time.

Of course that's what it's doing; that's all the indices you have allow it to do.

The problem, essentially, is the only index you have on Book doesn't help it narrow down authors any until during or after the sorting step; title is a dead column in the index. Usually filtering conditions should come first in an index, and a join counts as one.

Try one (or more) of the following indices:

-- More generally applicable
BOOK (authorId, releaseYear, title)
-- This one might help more for this query
BOOK (releaseYear, authorId, title)
-- Won't help at all for this query, but probably useful for others
BOOK (authorId, title)

As a side note: most of the time optimizers will spit out recommended indices for queries, does it list any?

Upvotes: 1

Gar
Gar

Reputation: 862

I doubt you'd get better using only indexes, if you really want this query to be really fast, you'd have to replicate the country into the book table, and then use two indexes :

index one author(id) (key i guess ?)

index two book(releaseyear,authorcountry,title)

Upvotes: 1

Related Questions