Jens
Jens

Reputation: 1397

Postgres: SELECT where a list of JOINed subtables contains specific data

In our project we have a table events each of which can have multiple dates. This means each row in dates has a event_id and a position where position (starting at 0) is unique within the same event_id. dates also (obviously) has a date column.

Now I want to select all events which have a specific list of date child values (ordered by position). Is this possible within a single SQL query?

Example Tables

events:  (id, name, event_type, description, comment, user_id)
dates:   (id, event_id, position, date_at, latitude, longitude, location)

id, position and lat/long columns are integer, date_at is a DATE, the rest are strings (character varying) which don't really matter here. For example:

INSERT INTO events (id, name) VALUES (1, 'Birthday Party');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 1, 1, 0, '2014-06-30');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 2, 1, 1, '2015-06-30');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 3, 1, 2, '2016-06-30');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 4, 1, 3, '2017-06-30');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 5, 1, 4, '2018-06-30');
INSERT INTO events (id, name) VALUES (2, 'S.O. Birthday Party');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 6, 2, 0, '2014-02-11');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 7, 2, 1, '2015-02-11');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 8, 2, 2, '2016-02-11');
INSERT INTO dates (id, event_id, position, date_at) VALUES ( 9, 2, 3, '2017-02-11');
INSERT INTO dates (id, event_id, position, date_at) VALUES (10, 2, 4, '2018-02-11');

I need a query that returns Event #1 when I feed it this array of dates: [2014-06-30, 2015-06-30, 2016-06-30, 2017-06-30, 2018-06-30]. The whole query string array will be automatically generated, so it can have any format and can be hardcoded in this example (we use Ruby on Rails). The list of possible events will have other restrictions (user_id, event_type, etc) so it won't be a long list (perhaps 10..100 events), so it need not be an indexable query.

It should not return Event #1 when fed these dates in another order or other dates or a subset of these dates. It should accept any number of dates as input, including an empty set.

Upvotes: 0

Views: 1241

Answers (1)

MatBailie
MatBailie

Reputation: 86765

I'm going to assume for the moment that you have created an additional table in which to store you list of specific dates

SELECT
  event.*
FROM
  events
INNER JOIN
(
  SELECT
    d.event_id
  FROM
    dates           AS d
  INNER JOIN
    list_of_dates   AS l
      ON l.date = d.date
  GROUP BY
    d.event_id
  HAVING
        COUNT(*)        = (SELECT COUNT(*) FROM list_of_dates)
    AND MAX(d.position) = MIN(d.position) + COUNT(*) - 1
)
  AS list
    ON list.event_id = events.id

This joins the dates table to the list of dates that you have.

It then groups the results by event_id.

The HAVING clause then only allows through an event_id if the returned number of dates matches the number in the list. (If I join 5 dates, I want only the event_ids that include all 5 of those dates).

The HAVING clause then ensure that all of those dates are sequential. If I have a list of 5 dates, the MAX(position) should be 4 more than the MIN(position)


I'm not actually sure if that last criteria is what you wanted, but hopefully it will give you a structure to work with for the criteria that you do need.

Also, note that this will require a full table scan. This type of multi-row search is always slow in SQL. (Academically, it's the same as searching an Entity-Attribute-Value table.)


EDIT: In response to question edit.

If hard-coding the dates into the query, you could do the following, but you must be crucially aware of the possibility of SQL Injection Attacks if any part of the query is controlled by the user.

SELECT
  event.*
FROM
  events
INNER JOIN
(
  SELECT
    event_id
  FROM
    dates           AS d
  WHERE
    date IN ('2014-06-30', '2015-06-30', '2016-06-30', '2017-06-30', '2018-06-30')
  GROUP BY
    event_id
  HAVING
        COUNT(*)        = 5
    AND MAX(d.position) = MIN(d.position) + 5 - 1
)
  AS list
    ON list.event_id = events.id

You would hard code both the list of dates and the two occurrences of the 5.


With regards to you requirement of so it need not be an indexable query, you will find that this is in fact impossible, to a degree. I'll try to explain why...

There is no mechanism to interrogate multiple rows together and at the same time. You do get simultaneously, but also individually. What I mean by that is that you can check is the date on this row in my list of dates? What you can't do is to say are all of these dates in my list.

The query above checks each row one at a time, is this date in my list, then calculates how many matched and were they all in order in the HAVING clause. If any event has any one of those dates you still need to process that date, then process the HAVING clause and only then realise that not all of the dates were present.

Each individual is this date in my list can be indexable, but the final HAVING clause will be a full scan of those matches. The result is that this does not scale particularly well at all. And there is very little that can be done about that.

Depending on the statistical structure of your data, you may be able to make minor, and quite esoteric, optimisations. For example...

INSERT INTO
  dates (id, event_id, first_position, date0, date1, date2, date3, date4, date5)
VALUES
  ( 1, 1, 0, '2014-06-30', '2015-06-30', '2016-06-30', '2017-06-30', '2018-06-30', NULL ),
  ( 1, 1, 1, '2015-06-30', '2016-06-30', '2017-06-30', '2018-06-30', NULL, NULL ),
  ( 1, 1, 2, '2016-06-30', '2017-06-30', '2018-06-30', NULL, NULL, NULL ),
  ( 1, 1, 3, '2017-06-30', '2018-06-30', NULL, NULL, NULL, NULL ),
  ( 1, 1, 4, '2018-06-30', NULL, NULL, NULL, NULL, NULL )
;

SELECT
  events.*
FROM
  events
INNER JOIN
  dates
    ON dates.event_id = events.id
WHERE
      dates.date0 = '2015-06-30'
  AND dates.date1 = '2016-06-30'
  AND dates.date2 = '2017-06-30'
;

Now, you could have an index on (date0, event_id) and quickly find all the events that have '2015-06-30' as one of the dates. You would also only get back one row from the dates table, and can quickly check if all of the other dates are also present.

It's dirty though. If an event could have 100 associated dates, you'd need 100 date fields. That also makes in inflexible (you might provision 128 dates, then in the future find you need 129). It also makes writing queries against it a bit messy (you're not passing parameters, you're writing a WHERE clause).

The index seek on date0 is still going to return rows that are later discarded by the remained of the where clause, but it's very likely that you'll get fewer of these cases than by using the HAVING clause method; so it should be faster. Unless you have so many columns that the extra i/o of reading so many redundant columns counters your savings. Over-all I'd still expect this to be measurably faster, and yet I'd still be very apprehensive of trying it.

Unfortunately the concept of searching multiple rows within grouped rows is always going to be messy and/or costly to execute.

Upvotes: 1

Related Questions