Supek
Supek

Reputation: 49

R - help me to scrape links from webpage

I'm scraping the data from IMDB movie list. I would like to scrape link for each movie, but not able to correctly identify where it is stored on the page.

This is how the part of the link is stored: link screenshot

What I have tried:

link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)

Whole code

install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)

#open webprowser (in my case Firefox, but can be chrome or internet explorer)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]

#set the start number for page link
ile<-seq(from=1, by=250, length.out = 1)

#empty frame for data
filmy_df=data.frame()

#loop reading the data
for (j in ile){
  #set the link for webpage
  newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
  startNumberURL<-paste0(newURL,j)

#open webpage
remDr$navigate(startNumberURL)

#read html code of the page
strona_int<-read_html(startNumberURL)

#rank section
rank_data<-html_nodes(strona_int,'.text-primary')
#konwersja rankingu na text
rank_data<-html_text(rank_data)
#konwersja na numeric
rank_data<-as.numeric(rank_data)

link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)

#release date
year<-html_nodes(strona_int,'.lister-item-year')
#konwersja na text
year<-html_text(year)
#usuniecie non numeric
year<-gsub("\\D","",year)
#ustawienie jako factor
year<-as.factor(year)

#title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#konwersja tytulu na text
title_data<-html_text(title_data)

#temporary data frame
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year)

#temp df to target df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}

#close browser
remDr$close()
#stop Selenium
rD[["server"]]$stop()

Expected solution: Scraped link for the each film which could be used later if required.

Upvotes: 0

Views: 816

Answers (1)

QHarr
QHarr

Reputation: 84465

Selenium is not required for gathering the links.

The links are a tags housed within a parent with class lister-item-header. You can match on those then extract the href attribute. You need to add the protocol and domain of "https://www.imdb.com"

In the css selector:

.lister-item-header a

The dot is a class selector for the parent class; the space between is a descendant combinator; the final a is a type selector for the child a tags.

library(rvest)
library(magrittr)

url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href")

One way of adding protocol and domain:

library(rvest)
library(magrittr)
library(xml2)

base <- 'https://www.imdb.com'
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- url_absolute(read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href"), base)

Reference:

  1. https://www.rdocumentation.org/packages/xml2/versions/1.2.0/topics/url_absolute
  2. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

Upvotes: 1

Related Questions