Gang Liang
Gang Liang

Reputation: 883

join of large tables with primary key

The table mydat has around 48.3M records with definition:

┌────────────────┬──────────────┬───────────┐
│     Column     │     Type     │ Modifiers │
├────────────────┼──────────────┼───────────┤
│ id             │ bigint       │ not null  │
│ dt             │ integer      │ not null  │
│ data           │ real         │           │
└────────────────┴──────────────┴───────────┘
Indexes:
    "mydat_pkey" PRIMARY KEY, btree (id, dt)

For each object as identified by id, there are about 40 records as denoted by the dt time field. The goal is to check the variation pattern between successive records, and the implementation is to join each record with its next one based on dt for each id. The query is as follows:

SELECT *
  FROM mydat AS dat1
  JOIN mydat AS dat2
    ON dat1.id = dat2.id
   AND dat1.dt = dat2.dt - 1;

The query plan is given below. Merge-Join is used, and it took forever to run. Also, we can see that the row count 427198811 is seriously over-estimated. Seems postgresql does not take the uniqueness of (id,dt) into consideration.

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│                                          QUERY PLAN                                           │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Merge Join  (cost=19919125.46..25466155.03 rows=247144681 width=222)                          │
│   Merge Cond: ((dat1.id = dat2.id) AND (dat1.dt = ((dat2.dt - 1))))                           │
│   ->  Sort  (cost=9959562.73..10080389.92 rows=48330876 width=111)                            │
│         Sort Key: dat1.id, dat1.dt                                                            │
│         ->  Seq Scan on mydat dat1  (cost=0.00..982694.76 rows=48330876 width=111)            │
│   ->  Materialize  (cost=9959562.73..10201217.11 rows=48330876 width=111)                     │
│         ->  Sort  (cost=9959562.73..10080389.92 rows=48330876 width=111)                      │
│               Sort Key: dat2.id, ((dat2.dt - 1))                                              │
│               ->  Seq Scan on mydat dat2  (cost=0.00..982694.76 rows=48330876 width=111)      │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

Out of curiosity, here is a naive join of mydat with itself:

SELECT *
  FROM mydat AS dat1
  JOIN mydat AS dat2
    ON dat1.id = dat2.id
   AND dat1.dt = dat2.dt;

The query plan is similar:

┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│                                          QUERY PLAN                                           │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ Merge Join  (cost=19919125.46..27878413.41 rows=427198811 width=222)                          │
│   Merge Cond: ((dat1.id = dat2.id) AND (dat1.dt = dat2.dt))                                   │
│   ->  Sort  (cost=9959562.73..10080389.92 rows=48330876 width=111)                            │
│         Sort Key: dat1.id, dat1.dt                                                            │
│         ->  Seq Scan on act_2003q1 dat1  (cost=0.00..982694.76 rows=48330876 width=111)       │
│   ->  Materialize  (cost=9959562.73..10201217.11 rows=48330876 width=111)                     │
│         ->  Sort  (cost=9959562.73..10080389.92 rows=48330876 width=111)                      │
│               Sort Key: dat2.id, dat2.dt                                                      │
│               ->  Seq Scan on act_2003q1 dat2  (cost=0.00..982694.76 rows=48330876 width=111) │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

Such a query plan puzzled me. Here, my question is what is the best practice for these use-cases? Thanks.

The above queries are tested on both Postgresql 9.5.3 on Windows and 9.4.6 on Linux: the results are similar.


Given @Erwin's suggestion, the lag over a window function is tested, and the result is much better than the initial merge-join approach: 511524ms to finish the query. As Erwin pointed out, the query is not exactly the same as the original one. Especially if there are gaps in the dt field, then some records would be unwanted.

This is an example I found that table partition is beneficial since the dataset I work with is larger than the example given above. The bottom-line of the problem is that postgresql uses disk to sort all records, and does not use index at all for both queries.

Upvotes: 1

Views: 74

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657932

If you actual goal is to have the previous value of data in each resulting row, then use a window function:

SELECT *, lag(data) OVER (PARTITION BY id ORDER BY dt) AS last_data
FROM   mydat;

The difference: rows for which there is no predecessor are still included. You may or may not want that. To exclude, use a subquery:

SELECT *
FROM (
   SELECT *, lag(data) OVER (PARTITION BY id ORDER BY dt) AS last_data
   FROM   mydat
   ) t
WHERE  last_data IS NOT NULL;

The remaining corner case was this: If data can be NULL, we cannot tell the difference between a genuine NULL value and 'not found'. So use a different impossible default value for the "not found" case like demonstrated:

SELECT *
FROM (
   SELECT *, lag(data, 1, '-infinity') OVER (PARTITION BY id ORDER BY dt) AS last_data
   FROM   mydat
   ) t
WHERE  last_data IS DISTINCT FROM '-infinity';

Each of these queries only needs a single sequential scan.

Upvotes: 2

Hogan
Hogan

Reputation: 70526

I don't know postgresql that well but this would be magic on db2 and mssql

WITH get_em as
( SELECT id, dt
  FROM mydat AS dat1
  JOIN mydat AS dat2
    ON dat1.id = dat2.id
   AND dat1.dt = dat2.dt - 1
)
select mydat.*
from mydat
join get_em on ON mydat.id = get_em.id AND mydat.dt = get_em.dt 

Upvotes: 0

Related Questions