Emmanuel Ogebe
Emmanuel Ogebe

Reputation: 1

Why is R Web scraping code to pick all cast members and directors on the IMDB website not working?

I want to scrape data from multiple pages of the IMDB website to get movie information on the Top Nigerian movies by popularity. I have been able to successfully get the title, year, synopsis, genre, certificate. However, I am having issues doing the same for the cast members and directors.

This is the main imdb link https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv

then I want to go into the page of each individual movie and pull out the full list of the cast and main directors

for example, the first movie on the list is "The Trade", I want to go into this page: https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm and extract the full names of all the cast members and directors,

This is what I did to get the title, year, synopsis, genre, and certificate:

library(rvest)
library(tidyverse)

movies6 = data.frame()

for(page_result in seq(from = 1, to = 201, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)

  df <- page %>% 
  html_nodes(".mode-advanced") %>% 
  map_df(~list(title = html_nodes(.x, '.lister-item-header a') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               year = html_nodes(.x, '.text-muted.unbold') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               genre = html_nodes(.x, '.genre') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               certificate = html_nodes(.x, '.certificate') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               rating = html_nodes(.x, '.ratings-imdb-rating strong') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               synopsis = html_nodes(.x, '.ratings-bar+ .text-muted') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .}))
              

movies6 = rbind(movies6, df)
print(paste("Page:", page_result))

}

It worked well and this was the result

(https://i.sstatic.net/xEUpo.jpg)

Then this is what I attempted to get the complete list of the movie cast

library(rvest)
library(tidyverse)
library(stringr)


get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  movie_cast = movie_page %>% html_nodes(".primary_photo+ td a") %>%
    html_text() %>% paste(collapse = ",")
  return(movie_cast)
}

movies5 = data.frame()

for(page_result in seq(from = 1, to = 151, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)
  
  movie_links = page %>% html_nodes(".lister-item-header a") %>%
    html_attr("href") %>%
    str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  

  cast = sapply(movie_links, FUN = get_cast, USE.NAMES = FALSE)

  movies5 = rbind(movies5, data.frame(cast = ifelse(length(cast)==0,NA,cast)))


print(paste("Page:", page_result))

}

But this is the result I am getting. Only the cast of the first movie per page is populating the list. The cast of the remaining 49 movies of each page isn't working. I modified the code to get the complete list of directors, but in a weird way, it brings out the cast instead, with the same issue as before.

(https://i.sstatic.net/XCMaJ.jpg)

I would really appreciate it if someone could assist me on what exactly to do regarding scraping data on the cast and directors. I have tried so many things that didn't work.

Upvotes: 0

Views: 128

Answers (2)

Emmanuel Ogebe
Emmanuel Ogebe

Reputation: 1

I was able to do this and it worked

get_cast = function(movie_link) {
      movie_page = read_html(movie_link)
      cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
      directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
      return(data.frame(cast = cast, directors = directors))
    }

movies2 = data.frame()

for(page_result in seq(from = 1, to = 951, by = 50)){
      link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
      page <- read_html(link)
      movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
        paste("http://www.imdb.com", ., sep="")
      movie_data = lapply(movie_links, get_cast)
      df = bind_rows(movie_data)
      movies2 = rbind(movies2, df)


    print(paste("Page:", page_result))
    }

Upvotes: 0

Russ
Russ

Reputation: 1431

What about this inside the second loop?

cast <- page %>% 
  html_nodes(".lister-item-content") %>% 
  html_nodes("p:nth-child(5)") %>% 
  html_text() %>% 
  stringr::str_remove_all("\n") %>% 
  stringr::str_extract("(?<=Stars:).*") %>%
  str_squish()

Here's what it looks like:

> head(cast, 15)
 [1] "Alexander Abolore, Anthony Abraham, Toyin Abraham, Lateef Adedimeji"      
 [2] "Rosie Afuwape, Onyii Alex, Nancy Chibuike, Ray Emodi"                     
 [3] "Etochi Asiegbu, Oge Asiegbu, Femi Branch, Monalisa Chinda"                
 [4] "Ayo Adesanya Hassan, Chinonso Arubayi, Alex Ayalogu, Rita Edward"         
 [5] "Melat Abera, Toba Aboyeji, Adunni Ade, Adebowale Adedayo"                 
 [6] "Gbenga Titiloye, Elvina Ibru, Osas Ighodaro, Sharon Ooja"                 
 [7] "Victor Agbu, Cynthia Ifeoma Amadiude, Norbert Asikhia, Roseanne Chikwendu"
 [8] "Winston Ajaelo, Kerry Amadi, Tchidi Chikere, Ikenna Ezeh"                 
 [9] "Ayenuro Ademola, Margaret Adewunmi, Rahila Ahmed, Abiola Atanda"          
[10] "Osas Ighodaro, Bolanle Ninalowo, Paul Utomi, Adunni Ade"                  
[11] "Kelvin Boateng, Nadia Buari, Pascaline Edwards, Jason El-Agha"            
[12] "Uzor Arukwe, Shalewa Ashafa, Tobi Bakre, Demi Banwo"                      
[13] "Mary Ann Apollo, Anita Enoyi, Joy Igbanugo, Desmond A. Ken"               
[14] "Nana Abdulmalik, Bimbo Ademoye, Ajakaya Aliyah, Monsuru Amodu"            
[15] "Regina Askia, Sola Fosudo, Pete Edochie, Dolly Unachukwu"       

Upvotes: 0

Related Questions