Reputation: 1
I want to scrape data from multiple pages of the IMDB website to get movie information on the Top Nigerian movies by popularity. I have been able to successfully get the title, year, synopsis, genre, certificate. However, I am having issues doing the same for the cast members and directors.
This is the main imdb link https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv
then I want to go into the page of each individual movie and pull out the full list of the cast and main directors
for example, the first movie on the list is "The Trade", I want to go into this page: https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm and extract the full names of all the cast members and directors,
This is what I did to get the title, year, synopsis, genre, and certificate:
library(rvest)
library(tidyverse)
movies6 = data.frame()
for(page_result in seq(from = 1, to = 201, by = 50)){
link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
page <- read_html(link)
df <- page %>%
html_nodes(".mode-advanced") %>%
map_df(~list(title = html_nodes(.x, '.lister-item-header a') %>%
html_text() %>%
{if(length(.) == 0) NA else .},
year = html_nodes(.x, '.text-muted.unbold') %>%
html_text() %>%
{if(length(.) == 0) NA else .},
genre = html_nodes(.x, '.genre') %>%
html_text() %>%
{if(length(.) == 0) NA else .},
certificate = html_nodes(.x, '.certificate') %>%
html_text() %>%
{if(length(.) == 0) NA else .},
rating = html_nodes(.x, '.ratings-imdb-rating strong') %>%
html_text() %>%
{if(length(.) == 0) NA else .},
synopsis = html_nodes(.x, '.ratings-bar+ .text-muted') %>%
html_text() %>%
{if(length(.) == 0) NA else .}))
movies6 = rbind(movies6, df)
print(paste("Page:", page_result))
}
It worked well and this was the result
(https://i.sstatic.net/xEUpo.jpg)
Then this is what I attempted to get the complete list of the movie cast
library(rvest)
library(tidyverse)
library(stringr)
get_cast = function(movie_link) {
movie_page = read_html(movie_link)
movie_cast = movie_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(movie_cast)
}
movies5 = data.frame()
for(page_result in seq(from = 1, to = 151, by = 50)){
link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
page <- read_html(link)
movie_links = page %>% html_nodes(".lister-item-header a") %>%
html_attr("href") %>%
str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
paste("http://www.imdb.com", ., sep="")
cast = sapply(movie_links, FUN = get_cast, USE.NAMES = FALSE)
movies5 = rbind(movies5, data.frame(cast = ifelse(length(cast)==0,NA,cast)))
print(paste("Page:", page_result))
}
But this is the result I am getting. Only the cast of the first movie per page is populating the list. The cast of the remaining 49 movies of each page isn't working. I modified the code to get the complete list of directors, but in a weird way, it brings out the cast instead, with the same issue as before.
(https://i.sstatic.net/XCMaJ.jpg)
I would really appreciate it if someone could assist me on what exactly to do regarding scraping data on the cast and directors. I have tried so many things that didn't work.
rvest web-scraping r imdb stringr web-scraping-multiple-pages
Upvotes: 0
Views: 128
Reputation: 1
I was able to do this and it worked
get_cast = function(movie_link) {
movie_page = read_html(movie_link)
cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
return(data.frame(cast = cast, directors = directors))
}
movies2 = data.frame()
for(page_result in seq(from = 1, to = 951, by = 50)){
link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
page <- read_html(link)
movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
paste("http://www.imdb.com", ., sep="")
movie_data = lapply(movie_links, get_cast)
df = bind_rows(movie_data)
movies2 = rbind(movies2, df)
print(paste("Page:", page_result))
}
Upvotes: 0
Reputation: 1431
What about this inside the second loop?
cast <- page %>%
html_nodes(".lister-item-content") %>%
html_nodes("p:nth-child(5)") %>%
html_text() %>%
stringr::str_remove_all("\n") %>%
stringr::str_extract("(?<=Stars:).*") %>%
str_squish()
Here's what it looks like:
> head(cast, 15)
[1] "Alexander Abolore, Anthony Abraham, Toyin Abraham, Lateef Adedimeji"
[2] "Rosie Afuwape, Onyii Alex, Nancy Chibuike, Ray Emodi"
[3] "Etochi Asiegbu, Oge Asiegbu, Femi Branch, Monalisa Chinda"
[4] "Ayo Adesanya Hassan, Chinonso Arubayi, Alex Ayalogu, Rita Edward"
[5] "Melat Abera, Toba Aboyeji, Adunni Ade, Adebowale Adedayo"
[6] "Gbenga Titiloye, Elvina Ibru, Osas Ighodaro, Sharon Ooja"
[7] "Victor Agbu, Cynthia Ifeoma Amadiude, Norbert Asikhia, Roseanne Chikwendu"
[8] "Winston Ajaelo, Kerry Amadi, Tchidi Chikere, Ikenna Ezeh"
[9] "Ayenuro Ademola, Margaret Adewunmi, Rahila Ahmed, Abiola Atanda"
[10] "Osas Ighodaro, Bolanle Ninalowo, Paul Utomi, Adunni Ade"
[11] "Kelvin Boateng, Nadia Buari, Pascaline Edwards, Jason El-Agha"
[12] "Uzor Arukwe, Shalewa Ashafa, Tobi Bakre, Demi Banwo"
[13] "Mary Ann Apollo, Anita Enoyi, Joy Igbanugo, Desmond A. Ken"
[14] "Nana Abdulmalik, Bimbo Ademoye, Ajakaya Aliyah, Monsuru Amodu"
[15] "Regina Askia, Sola Fosudo, Pete Edochie, Dolly Unachukwu"
Upvotes: 0