Reputation: 43

Speeding up an expensive join operation - Huge table with a small one

I'm using Postgres to store a large number of transactions and trying to keep the read times for a specific Select statement in tens of milliseconds.

Schema of TableA (> 100mm rows): (userID int, itemID int). Indexed by userID

Schema of TableB (1mm rows): (categoryID int, itemID int). Indexed by categoryID. Number of categories = 500 and each itemID only belongs to one category.

The query I want to optimize for which currently takes me ~100 ms to execute is:

select * from TableA 
where userID = x and itemID in 
(select itemID from TableB
where categoryID = y)

A simple way to solve this would be to create a denormalised table with userID, itemID and categoryID as columns and index on (userID, categoryID). However, the categoryID -> itemID mapping can change so I wanted to avoid doing a full scan of the table and update the rows each time this happens.

Are there any other techniques/indexing method to speed up this JOIN operation? Any alternative ways to arrange the data would also be appreciated. Thanks!

Edit: Adding a sample query plan.

[('  ->  Hash Semi Join  (cost=159.50..382.67 rows=164 width=50)'),
 ('        Hash Cond: (tableA.itemId = tableB.itemId)'),
 ('        ->  Index Scan using userId on tableA  (cost=0.57..208.31 rows=5185 width=50)'),
 ('              Index Cond: (userId = 4000)'),
 ('        ->  Hash  (cost=117.05..117.05 rows=3350 width=4)'),
 ('              Buckets: 4096  Batches: 1  Memory Usage: 161kB',),
 ('              ->  Index Scan using categoryId on tableB (cost=0.42..117.05 rows=3350 width=4)'),
 ('                    Index Cond: (categoryId = 1002)',), ('Planning time: 0.149 ms',)]

Upvotes: 1

Answers (3)

ananis

Reputation: 43

I found a neat way to solve this by making tableA denormalized and using Postgres foreign keys.

Schema of TableA (> 100mm rows): (userID int, itemID int, categoryID int)
Index - (userID, categoryID)
FK - (itemID, categoryID) references tableB (itemID, categoryID)
update cascade
delete cascade

Schema of TableB (1mm rows): (categoryID int, itemID int)
PK - (itemID, categoryID)

All user-item pairs for a category can now be fetched by doing a select on tableA. The foreign key constraint makes sure that rows in tableA get updated if the categoryID for any item changes in tableB.

select userid, itemid from tableA where userid = x and categoryid = y

Thanks for your suggestions!

Upvotes: 1

Łukasz Kamiński

Reputation: 5940

Another approach would be to create array of valid itemID and filter by it. Then you will avoid JOIN operation. It can however be slower, depending on your data.

select * from TableA 
where userID = x
  and itemID = any((select array_agg(/*DISTINCT */itemID)
                      from TableB
                     where categoryID = y)::int4[])

Upvotes: 0

M. Schreiber

Reputation: 106

Maybe Exists will help here: Difference between EXISTS and IN

For your query:

Select * from TableA a
Where userID = x
and exists (Select itemId from TableB b where categoryID = y  and a.itemId = b.itemId)

Upvotes: 1

Speeding up an expensive join operation - Huge table with a small one

Answers (3)

Related Questions