How to optimize SQL query with WHERE IN subquery

Question

I have two tables in MySQL 5.6 for collecting event data. When an event occurs it generates data in certain time period. The parent table named 'event' remembers the last state of event. The child table named 'event_version' remembers all data versions generated by any event. Schemas for this tables looks like that:

CREATE TABLE `event` (
`id` BIGINT(20) NOT NULL,
`version_id` BIGINT(20)', -- refers to last event_version   
`version_number` BIGINT(20)', -- consecutive numbers increased when new version appears 
`first_event_time` TIMESTAMP(6),  -- time when a set of event data was generated first time,
-- it is immutable after creation
`event_time` TIMESTAMP(6), -- time when a set of event data changed last time
`other_event_data` VARCHAR(30),--more other columns
PRIMARY KEY (`id`),
INDEX `event_time` (`event_time`),
INDEX `version_id` (`version_id`),
CONSTRAINT `FK_version_id` FOREIGN KEY (`version_id`) REFERENCES `event_version` (`id`)
);




CREATE TABLE `event_version` (
`id` BIGINT(20) NOT NULL,
`event_id` BIGINT(20)', -- refers to event  
`version_number` BIGINT(20)', -- consecutive numbers increased when new version appears 
`event_time` TIMESTAMP(6) NULL DEFAULT NULL, -- time when a set of event data was generated
`other_event_data` VARCHAR(30),--more other columns
PRIMARY KEY (`id`),
INDEX `event_time` (`event_time`), -- time when a set of event data changed
INDEX `event_id` (event_id),
CONSTRAINT `FK_event_id` FOREIGN KEY (`event_id`) REFERENCES `event` (`id`)
);

I want to get all event_version rows which have new rows added in selected time period. For example: there is na event with event.id=21 that appeared at 2019-04-28 and it produced versions at:

2019-04-28 version_number: 1, event_version.event_id=21  
2019-04-30 version_number: 2, event_version.event_id=21  
2019-05-02 version_number: 3, event_version.event_id=21  
2019-05-04 version_number: 4, event_version.event_id=21

I want this records to be found when I search for period from 2019-05-01 to 2019-06-01.

The idea is to find all event_version.event_id created in selected period, and then all rows from event_version which have event_id from this list. To create the list of event id I have an inner SELECT queries: The first query:

SELECT DISTINCT event_id FROM event_version WHERE event_time>='2019-05-01' AND event_time<'2019-06-01';

It takes about 10s and returns about 500 000 records.

But I have second query which uses parent table and looks like this:

SELECT id FROM event WHERE (first_event_time>='2019-05-01' AND first_event_time<'2019-06-01') OR (first_event_time<'2019-05-01' AND event_time>'2019-05-01');

It takes about 7s and returns the same set of ids.

Then I use this subqueries in my final query:

SELECT * FROM event_version WHERE event_id IN ();

The problem is that when I use the second subquery it takes about 8s to produce result (about 5 millions records). Creating the same result with the first subquery takes 3 minutes and 15s.

I can't understand why there is such a big difference in executing time even if subqueries produce the same result list. I want to use a subquery like in the first example because it depends only from event_time, not from additional data from parent table. I have more similar tables and there I can rely only on event_time.

My question: is there a possibility to optimize the query to produce expected result using only event_time?

Paul Spiegel · Accepted Answer

As I understand, you want the following query to be optimized:

SELECT * 
FROM event_version
WHERE event_id IN (
  SELECT DISTINCT event_id
  FROM event_version
  WHERE event_time >= '2019-05-01'
    AND event_time <  '2019-06-01'
)

Things I would try:

Create an index on event_version(event_time, event_id). This should improve the performance of the subquery by avoiding a second lookup to get the event_id. Though the overall performance will probably be similar. The reason is that WHERE IN () tend to be slow (at least in older versions) when the subquery returns a lot of rows.

Try a JOIN with your subquery as derived table:

SELECT * 
FROM (
  SELECT DISTINCT event_id
  FROM event_version
  WHERE event_time >= '2019-05-01'
    AND event_time <  '2019-06-01'
) s
JOIN event_version USING(event_id)

Look if the index mentioned above is of any help here.

Try an EXISTS subquery:

SELECT v.*
FROM event e
JOIN event_version v ON v.event_id = e.id
WHERE EXISTS (
  SELECT *
  FROM event_version v1
  WHERE v1.event_id = e.id
    AND v1.event_time >= '2019-05-01'
    AND v1.event_time <  '2019-06-01'
)

Here you would need an index on event_version(event_id, event_time). Though the performance might be even worse. I would bet on the derived table join solution.

My guess - why your second query runs faster - is that the optimizer is able to convert the IN condition to a JOIN, because the returned column is the primary key of the event table.

How to optimize SQL query with WHERE IN subquery

Answers (2)

Related Questions