john
john

Reputation: 1330

How to check another row if value exists?

This is driving me nuts. I have dumped the imdb db using imdbpy. I'm trying to find US movies that have the actor data available by the first letter of the movie.

Below is an example of a query that fetches the movies without acto data. This runs pretty quick:

SELECT DISTINCT title.id,title.title,title.production_year
FROM  title

INNER JOIN movie_info ON
(movie_info.movie_id =  title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info =  'USA') 

WHERE  title LIKE  'a%'
AND  title.kind_id =  1
LIMIT 75

The cast data is stored in a separate table called cast_info and contains about 22 million records. The nr_order column contains the order of credits for actors in a movie. For example, Tom Hank would be 1 in Forrest Gump. There are typically dozens of rows for each movie_id.

So to check to see if the actor data is available, there should be at least one row that isn't null for that particular movie_id. If all the values in nr_order for a movie_id are null, it does NOT contain the data I need.

To attempt to grab this information is used the query below:

SELECT DISTINCT title.id,title.title,title.production_year
FROM  title

INNER JOIN movie_info ON
(movie_info.movie_id =  title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info =  'USA') 

INNER JOIN cast_info ON 
(cast_info.movie_id = title.id
AND
cast_info.nr_order = 1)

WHERE  title LIKE  'a%'
AND  title.kind_id =  1
LIMIT 75

For some reason the query becomes very slow. It takes .3-.7 for the first query and about 6-10 seconds for the second. I added an index on cast_info.nr_order but it didn't help.

The EXPLAIN output:

+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table     | type  | possible_keys                                    | key               | key_len | ref          | rows  | Extra                       |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
|  1 | SIMPLE      | title     | range | PRIMARY,title_idx_title,fk_kind_type_id_4        |  title_idx_title  | 257     | NULL         | 132801| Using where; Using temporary|
|  1 | SIMPLE      | movie_info| ref   | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4       | imdb.title.id| 4     | Using where; Distinct       |
|  1 | SIMPLE      | table1    | ref   | cast_info_idx_mid,nr_order                       | cast_info_idx_mid | 4       | imdb.title.id| 12    | Using where; Distinct       |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+

Any ideas would be very helpful!

EDIT: EXPLAIN from 1st query

+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table     | type  | possible_keys                                    | key               | key_len | ref          | rows  | Extra                       |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
|  1 | SIMPLE      | title     | range | PRIMARY,title_idx_title,fk_kind_type_id_4        |  title_idx_title  | 257     | NULL         | 132801| Using where; Using temporary|
|  1 | SIMPLE      | movie_info| ref   | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4       | imdb.title.id| 4     | Using where; Distinct       |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+

Upvotes: 0

Views: 271

Answers (1)

lc.
lc.

Reputation: 116538

Since you're only concerned with whether there is or is not cast information available, you could try using EXISTS instead:

SELECT DISTINCT title.id,title.title,title.production_year
FROM  title

INNER JOIN movie_info ON
(movie_info.movie_id =  title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info =  'USA') 

WHERE  title LIKE  'a%'
AND  title.kind_id =  1
AND EXISTS(SELECT 1 FROM cast_info WHERE cast_info.movie_id = title.id AND cast_info.nr_order IS NOT NULL)
LIMIT 75

I'm not sure exactly the explanation for your behavior, but the DISTINCT could be doing something funny with lots of rows on the join - or at least lots of rows on the joined product - (note the Distinct being applied to the cast_info table in the explain).

Upvotes: 1

Related Questions