Ms workaholic
Ms workaholic

Reputation: 393

Performance Issue with finding recent date of each group and joining to all records

I have following tables:

CREATE TABLE person (
  id INTEGER NOT NULL,
  name TEXT,
  CONSTRAINT person_pkey PRIMARY KEY(id)
);

INSERT INTO person ("id", "name")
VALUES 
  (1, E'Person1'),
  (2, E'Person2'),
  (3, E'Person3'),
  (4, E'Person4'),
  (5, E'Person5'),
  (6, E'Person6');

CREATE TABLE person_book (
  id INTEGER NOT NULL,
  person_id INTEGER,
  book_id INTEGER,
  receive_date DATE,
  expire_date DATE,
  CONSTRAINT person_book_pkey PRIMARY KEY(id)
);

/* Data for the 'person_book' table  (Records 1 - 9) */

INSERT INTO person_book ("id", "person_id", "book_id", "receive_date", "expire_date")
VALUES 
  (1, 1,  1, E'2016-01-18', NULL),
  (2, 1,  2, E'2016-02-18', E'2016-10-18'),
  (3, 1,  4, E'2016-03-18', E'2016-12-18'),
  (4, 2,  3, E'2017-02-18', NULL),
  (5, 3,  5, E'2015-02-18', E'2016-02-23'),
  (6, 4, 34, E'2016-12-18', E'2018-02-18'),
  (7, 5, 56, E'2016-12-28', NULL),
  (8, 5, 34, E'2018-01-19', E'2018-10-09'),
  (9, 5, 57, E'2018-06-09', E'2018-10-09');

CREATE TABLE book (
  id INTEGER NOT NULL,
  type TEXT,
  CONSTRAINT book_pkey PRIMARY KEY(id)
) ;

/* Data for the 'book' table  (Records 1 - 8) */

INSERT INTO book ("id", "type")
VALUES 
  ( 1, E'Btype1'),
  ( 2, E'Btype2'),
  ( 3, E'Btype3'),
  ( 4, E'Btype4'),
  ( 5, E'Btype5'),
  (34, E'Btype34'),
  (56, E'Btype56'),
  (67, E'Btype67');

My query should list name of all persons and for persons with recently received book types of (book_id IN (2, 4, 34, 56, 67)), it should display the book type and expire date; if a person hasn’t received such book type it should display blank as book type and expire date.

My query looks like this:

SELECT p.name,
   pb.expire_date,
   b.type
   FROM 
   (SELECT p.id AS person_id, MAX(pb.receive_date) recent_date
    FROM 
        Person p
        JOIN person_book pb ON pb.person_id = p.id
    WHERE pb.book_id IN (2, 4, 34, 56, 67)
    GROUP BY p.id
   )tmp 
   JOIN person_book pb ON pb.person_id = tmp.person_id
   AND tmp.recent_date = pb.receive_date AND pb.book_id IN 
   (2, 4, 34, 56, 67)
   JOIN book b ON b.id = pb.book_id           
   RIGHT JOIN Person p ON p.id = pb.person_id 

The (correct) result:

  name   | expire_date |  type
---------+-------------+---------
 Person1 | 2016-12-18  | Btype4
 Person2 |             |
 Person3 |             |
 Person4 | 2018-02-18  | Btype34
 Person5 | 2018-10-09  | Btype34
 Person6 |             |

The query works fine but since I'm right joining a small table with a huge one, it's slow. Is there any efficient way of rewriting this query?

My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.

Upvotes: 1

Views: 45

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656586

Problems with your setup

My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.

That makes two major problems before even looking at the query:

  1. Postgres 8.4 is just too old. Especially for "production". It has reached EOL in July 2014. No more security upgrades, hopelessly outdated. Urgently consider upgrading to a current version.

  2. It's a loaded footgun to use very different versions for development and production. Confusion and errors that go undetected. We have seen more than one desperate request here on SO stemming from this folly.

Better query

This equivalent should be substantially simpler and faster (works in pg 8.4, too):

SELECT p.name, pb.expire_date, b.type
FROM  (
   SELECT DISTINCT ON (person_id)
          person_id, book_id, expire_date
   FROM   person_book
   WHERE  book_id IN (2, 4, 34, 56, 67)
   ORDER  BY person_id, receive_date DESC NULLS LAST
   ) pb
JOIN   book        b ON b.id = pb.book_id
RIGHT  JOIN person p ON p.id = pb.person_id;

To optimize read performance, this partial multicolumn index with matching sort order would be perfect:

CREATE INDEX ON person_book (person_id, receive_date DESC NULLS LAST)
WHERE  book_id IN (2, 4, 34, 56, 67);

In modern Postgres versions (9.2 or later) you might append book_id, expire_date to the index columns to get index-only scans. See:

About DISTINCT ON:

About DESC NULLS LAST:

Upvotes: 1

Related Questions