Reputation: 1335
I am looking to scrape a website. Then, for each scraped item I want to scrape further info on sub web pages. As an example I'll use the IMDB website. I am using the rvest
package and the selector gadget in Google chrome.
From the IMDB site I can get the top 250 rated TV shows as follows:
library('rvest')
# url to be scrapped
url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap
movies_html <- html_nodes(webpage,'.titleColumn a')
#Converting the TV show data to text
movies <- html_text(movies_html)
head(movies)
[1] "Planet Earth II" "Band of Brothers" "Planet Earth" "Game of Thrones" "Breaking Bad" "The Wire"
Each of the top 250 movies in the list is a clickable link that gives additional info on each of the movies. In this case for each movie in movies
I would like to also scrape the cast and store this in another list
. For example if you click on the second to top movie "Band of brothers" and scroll down the cast consists of ~40 people from Scott Grimes to Phil McKee.
Pseudo code of what I want to do:
for(i in movies) {
url <- 'http://www.imdb.com/chart/toptv/i'
webpage <- read_html(url)
cast_html <- html_nodes(webpage,'#titleCast .itemprop')
castList<- html_text(cast_html)
}
I am sure this is very simple but it is new to me and I don't know how to search for the right terms to find a solution.
Upvotes: 0
Views: 1696
Reputation: 7174
If I understand you correctly, you are looking to find a way to
main_url
)Get the titles of the top 250 shows (m_titles
)
Visit those URLs (m_urls
)
m_cast
)Correct?
We'll start off by defining a function that extracts the cast from a TV Show page:
getcast <- function(url){
page <- read_html(url)
nodes <- html_nodes(page, '#titleCast .itemprop')
cast <- html_text(nodes)
inds <- seq(from=2, to=length(cast), by=2)
cast <- cast[inds]
return(cast)
}
With that in place, we can work down points 1 through 4:
# Open main_url and navigate to interesting part of the page:
main_url <- "http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2"
main_page <- read_html(url)
movies_html <- html_nodes(main_page, '.titleColumn a')
# From the interesting part, get the titles and URLs:
m_titles <- html_text(movies_html)
sub_urls <- html_attr(movies_html, 'href')
m_urls <- paste0('http://www.imdb.com', sub_urls)
# Use `getcast()` to extract movie cast from every URL in `m_urls`
m_cast <- lapply(m_urls, getcast)
Upvotes: 1