flee
flee

Reputation: 1335

Web scraping multiple levels of a website

I am looking to scrape a website. Then, for each scraped item I want to scrape further info on sub web pages. As an example I'll use the IMDB website. I am using the rvest package and the selector gadget in Google chrome.

From the IMDB site I can get the top 250 rated TV shows as follows:

library('rvest')

# url to be scrapped
url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2'

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap
movies_html <- html_nodes(webpage,'.titleColumn a')

#Converting the TV show data to text
movies <- html_text(movies_html)

head(movies)
[1] "Planet Earth II"  "Band of Brothers" "Planet Earth"     "Game of Thrones"  "Breaking Bad"     "The Wire"

Each of the top 250 movies in the list is a clickable link that gives additional info on each of the movies. In this case for each movie in movies I would like to also scrape the cast and store this in another list. For example if you click on the second to top movie "Band of brothers" and scroll down the cast consists of ~40 people from Scott Grimes to Phil McKee.

Pseudo code of what I want to do:

for(i in movies) {
  url <- 'http://www.imdb.com/chart/toptv/i'
  webpage <- read_html(url)
  cast_html <- html_nodes(webpage,'#titleCast .itemprop')
  castList<- html_text(cast_html)
}

I am sure this is very simple but it is new to me and I don't know how to search for the right terms to find a solution.

Upvotes: 0

Views: 1696

Answers (1)

KenHBS
KenHBS

Reputation: 7174

If I understand you correctly, you are looking to find a way to

  1. Identify the URLs to the movie pages from the top 250 (main_url)
  2. Get the titles of the top 250 shows (m_titles)

  3. Visit those URLs (m_urls)

  4. Extract the cast of those TV shows (m_cast)

Correct?

We'll start off by defining a function that extracts the cast from a TV Show page:

getcast <- function(url){
  page <- read_html(url)
  nodes <- html_nodes(page, '#titleCast .itemprop')
  cast <- html_text(nodes)

  inds <- seq(from=2, to=length(cast), by=2)
  cast <- cast[inds]
  return(cast)
}

With that in place, we can work down points 1 through 4:

# Open main_url and navigate to interesting part of the page:
main_url <- "http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2"

main_page <- read_html(url)
movies_html <- html_nodes(main_page, '.titleColumn a')

# From the interesting part, get the titles and URLs:
m_titles <- html_text(movies_html)

sub_urls <- html_attr(movies_html, 'href')
m_urls <- paste0('http://www.imdb.com', sub_urls)

# Use `getcast()` to extract movie cast from every URL in `m_urls`
m_cast <- lapply(m_urls, getcast)

Upvotes: 1

Related Questions