Reputation: 1330
This is driving me nuts. I have dumped the imdb db using imdbpy. I'm trying to find US movies that have the actor data available by the first letter of the movie.
Below is an example of a query that fetches the movies without acto data. This runs pretty quick:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
WHERE title LIKE 'a%'
AND title.kind_id = 1
LIMIT 75
The cast data is stored in a separate table called cast_info
and contains about 22 million records. The nr_order
column contains the order of credits for actors in a movie. For example, Tom Hank would be 1 in Forrest Gump. There are typically dozens of rows for each movie_id
.
So to check to see if the actor data is available, there should be at least one row that isn't null for that particular movie_id
. If all the values in nr_order
for a movie_id
are null, it does NOT contain the data I need.
To attempt to grab this information is used the query below:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
INNER JOIN cast_info ON
(cast_info.movie_id = title.id
AND
cast_info.nr_order = 1)
WHERE title LIKE 'a%'
AND title.kind_id = 1
LIMIT 75
For some reason the query becomes very slow. It takes .3-.7 for the first query and about 6-10 seconds for the second. I added an index on cast_info
.nr_order
but it didn't help.
The EXPLAIN output:
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| 1 | SIMPLE | title | range | PRIMARY,title_idx_title,fk_kind_type_id_4 | title_idx_title | 257 | NULL | 132801| Using where; Using temporary|
| 1 | SIMPLE | movie_info| ref | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4 | imdb.title.id| 4 | Using where; Distinct |
| 1 | SIMPLE | table1 | ref | cast_info_idx_mid,nr_order | cast_info_idx_mid | 4 | imdb.title.id| 12 | Using where; Distinct |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
Any ideas would be very helpful!
EDIT: EXPLAIN from 1st query
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| 1 | SIMPLE | title | range | PRIMARY,title_idx_title,fk_kind_type_id_4 | title_idx_title | 257 | NULL | 132801| Using where; Using temporary|
| 1 | SIMPLE | movie_info| ref | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4 | imdb.title.id| 4 | Using where; Distinct |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
Upvotes: 0
Views: 271
Reputation: 116538
Since you're only concerned with whether there is or is not cast information available, you could try using EXISTS
instead:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
WHERE title LIKE 'a%'
AND title.kind_id = 1
AND EXISTS(SELECT 1 FROM cast_info WHERE cast_info.movie_id = title.id AND cast_info.nr_order IS NOT NULL)
LIMIT 75
I'm not sure exactly the explanation for your behavior, but the DISTINCT
could be doing something funny with lots of rows on the join - or at least lots of rows on the joined product - (note the Distinct being applied to the cast_info table in the explain).
Upvotes: 1