Doki
Doki

Reputation: 3

Querying from IMDB Database using MySQL

I wrote a SQL query to answer the following question:

Find all the actors that made more movies with Yash Chopra than any other director in the IMBD database.

Sample schema:

person
(pid *
,name
);

m_cast
(mid *
,pid *
);

m_director
(mid*
,pid*
);

* = (component of) PRIMARY KEY

Following is my query:

WITH common_actors AS 
        (SELECT A.actor_id as actors, B.director_id as director_id, B.movies as movies_with_director,
        B.director_id as yash_chops_id, B.movies as movies_with_yash_chops FROM
        (SELECT M_Cast.PID as actor_id, M_Director.PID as director_id, COUNT(*) as movies from M_Cast
        left join M_Director  
        ON M_Cast.MID = M_Director.MID
        GROUP BY actor_id, director_id) A
        JOIN
        (SELECT M_Cast.PID as actor_id, M_Director.PID as director_id, COUNT(*) as movies from M_Cast
        left join M_Director  
        ON M_Cast.MID = M_Director.MID
        GROUP BY actor_id, director_id
        )B
        ON A.actor_id = B.actor_id
        WHERE B.director_id in (SELECT PID FROM Person WHERE Name LIKE 
        '%Yash%Chopra%'))

SELECT distinct actors as actor_id, movies_with_yash_chops as total_movies FROM common_actors
    WHERE actors NOT IN (SELECT actors FROM common_actors WHERE movies_with_director > movies_with_yash_chops)

And the result obtained from this is of length: 430 rows. However the result obtained should be of length 243 rows. Could anyone please suggest where I went wrong in my query? Is my approach right?

Sample result:

    Actor name
  0 Sharib Hashmi
  1 Kulbir Badesron
  2 Gurdas Maan
  3 Parikshat Sahni
...
242 Ramlal Shyamlal

Thanks in advance!

Upvotes: 0

Views: 1098

Answers (1)

Strawberry
Strawberry

Reputation: 33945

Consider the following:

DROP TABLE IF EXISTS person;

CREATE TABLE person
(person_id SERIAL PRIMARY KEY
,name VARCHAR(20) NOT NULL UNIQUE
);

DROP TABLE IF EXISTS movie;

CREATE TABLE movie
(movie_id SERIAL PRIMARY KEY
,title VARCHAR(50) NOT NULL UNIQUE
);

DROP TABLE IF EXISTS m_cast;

CREATE TABLE m_cast
(movie_id INT NOT NULL
,person_id INT NOT NULL
,PRIMARY KEY(movie_id,person_id)
);

DROP TABLE IF EXISTS m_director;

CREATE TABLE m_director
(movie_id INT NOT NULL
,person_id INT NOT NULL
,PRIMARY KEY(movie_id,person_id)
);

INSERT INTO person (name) VALUES
('Steven Feelberg'),
('Manly Kubrick'),
('Alfred Spatchcock'),
('Fred Pitt'),
('Raphael DiMaggio'),
('Bill Smith');

INSERT INTO movie VALUES
(1,'Feelberg\'s Movie with Fred & Raph'),
(2,'Feelberg and Fred Ride Again'),
(3,'Kubrick shoots DiMaggio'),
(4,'Kubrick\'s Movie with Bill Smith'),
(5,'Spatchcock Presents Bill Smith');

INSERT INTO m_director VALUES
(1,1),
(2,1),
(3,2),
(4,2),
(5,3);

INSERT INTO m_cast VALUES
(1,4),
(1,5),
(2,4),
(3,5),
(4,6),
(5,6);

I've included the movie table only for ease of reference. It's not relevant to the actual problem. Also, note that this model assumes that cast members are only listed once, regardless of whether or not they have multiple roles in a given film.

The following query asks 'how often have each actor and director worked together'...

An actor is any person who has been a cast member of any movie. A director is any person who has been a director of any movie.

SELECT a.name actor
     , d.name director
     , COUNT(DISTINCT ma.movie_id) total
  FROM person d
  JOIN m_director md 
    ON md.person_id = d.person_id
  JOIN person a
  LEFT
  JOIN m_cast ma 
    ON ma.person_id = a.person_id
   AND ma.movie_id = md.movie_id
  JOIN m_cast x
    ON x.person_id = a.person_id
 GROUP
    BY actor
     , director;
     
+-------------------+-------------------+-------+
| actor             | director          | total |
+-------------------+-------------------+-------+
| Fred Pitt         | Alfred Spatchcock |     0 |
| Fred Pitt         | Manly Kubrick     |     0 |
| Fred Pitt         | Steven Feelberg   |     2 |
| Raphael DiMaggio  | Alfred Spatchcock |     0 |
| Raphael DiMaggio  | Manly Kubrick     |     1 |
| Raphael DiMaggio  | Steven Feelberg   |     1 |
| Bill Smith        | Alfred Spatchcock |     1 |
| Bill Smith        | Manly Kubrick     |     1 |
| Bill Smith        | Steven Feelberg   |     0 |
+-------------------+-------------------+-------+

By observation, we can see that:

  • the only actor to work more often with Feelberg than any other director is Fred Pritt
  • Raphael DiCaprio and Bill Smith have both worked equally often with two directors (albeit different directors)

EDIT: While I'm not seriously advocating this as a solution, the following is simply to demonstrate that the kernel provided above is really all you need to solve the problem...

SELECT x.*
  FROM 
     ( SELECT a.* 
 FROM 
    ( SELECT a.name actor
           , d.name director
           , COUNT(DISTINCT ma.movie_id) total
        FROM person d
        JOIN m_director md 
          ON md.person_id = d.person_id
        JOIN person a
        LEFT
        JOIN m_cast ma 
          ON ma.person_id = a.person_id
         AND ma.movie_id = md.movie_id
        JOIN m_cast x
          ON x.person_id = a.person_id
       GROUP
          BY actor
           , director
    ) a
 LEFT
 JOIN
    ( SELECT a.name actor
           , d.name director
           , COUNT(DISTINCT ma.movie_id) total
        FROM person d
        JOIN m_director md 
          ON md.person_id = d.person_id
        JOIN person a
        LEFT
        JOIN m_cast ma 
          ON ma.person_id = a.person_id
         AND ma.movie_id = md.movie_id
        JOIN m_cast x
          ON x.person_id = a.person_id
       GROUP
          BY actor
           , director
    ) b
   ON b.actor = a.actor
  AND b.director <> a.director 
  AND b.total > a.total
WHERE b.actor IS NULL
) x
LEFT JOIN
     ( SELECT a.* 
 FROM 
    ( SELECT a.name actor
           , d.name director
           , COUNT(DISTINCT ma.movie_id) total
        FROM person d
        JOIN m_director md 
          ON md.person_id = d.person_id
        JOIN person a
        LEFT
        JOIN m_cast ma 
          ON ma.person_id = a.person_id
         AND ma.movie_id = md.movie_id
        JOIN m_cast x
          ON x.person_id = a.person_id
       GROUP
          BY actor
           , director
    ) a
 LEFT
 JOIN
    ( SELECT a.name actor
           , d.name director
           , COUNT(DISTINCT ma.movie_id) total
        FROM person d
        JOIN m_director md 
          ON md.person_id = d.person_id
        JOIN person a
        LEFT
        JOIN m_cast ma 
          ON ma.person_id = a.person_id
         AND ma.movie_id = md.movie_id
        JOIN m_cast x
          ON x.person_id = a.person_id
       GROUP
          BY actor
           , director
    ) b
   ON b.actor = a.actor
  AND b.director <> a.director 
  AND b.total > a.total
WHERE b.actor IS NULL
) y
ON y.actor = x.actor AND y.director <> x.director
WHERE y.actor IS NULL;

+-----------+-----------------+-------+
| actor     | director        | total |
+-----------+-----------------+-------+
| Fred Pitt | Steven Feelberg |     2 |
+-----------+-----------------+-------+

This returns a list of every actor, and the director with whom they've worked most often. In this case, because Bill Smith and Raphael DiMaggio have worked most often equally with two directors, they are excluded from the result.

The answer to your problem is simply to select from this list all rows with Yash Chopra listed as the director.

Upvotes: -1

Related Questions