R Webscraping with dynamic tables

Question

For my first exercise in webscraping in R, I am trying to figure out how to search through opera tickets being sold to eventually find the best deal. I would like to do two things:

Create a table of categories and prices to be able to search the best price in any category
Save a link to the http address of the best deal (price per category).

The problem I ran into is that I can only see 15 observations, but the table can potentially be much larger.

library(rvest)
rmSpace <- function(x){
        x<-gsub("	","",x) 
    x<-gsub("
","",x)
}

page <- url %>% html()

date <- page %>%
   html_nodes(".date-tabdyn") %>%
   html_text() 
date <- date[-1]
date <-rmSpace(date)

category <-  page %>%
  html_nodes(".td_description .bold") %>%
  html_text() 
category<-rmSpace(category)

description <-  page %>%
  html_nodes(".td_description") %>%
  html_text() 
description <- description[-1]
description <- rmSpace(description)

price <-  page %>%
  html_nodes(".valeur_revente .montant-numeric") %>%
  html_text() 

price_normal <-  page %>%
  html_nodes(".valeur_faciale .montant-numeric") %>%
  html_text() 

links <-  page %>% html_nodes(".button_eae9e5") %>% html_attr("onclick")
links <- substr(links,31,nchar(links)-2)
tab <- cbind(category, price, price_normal, date, description, links)

UPDATE: I was able to get a nice table with rvest, but I haven't figured out how to solve the 15 view limit.

UPDATE 2: It appears there is a POST request that returns a json file. I imagine I can use that to help return a larger table, but I'm lost on how to do that.

Mark · Accepted Answer

You should look into RSelenium. You can get details on Selenium from: http://docs.seleniumhq.org/.

Essentially, Selenium creates a webbrowser which renders the actual webpage and you can then scrape the generated HTML. Depending on the browser you use, you'll be able to handle all sorts of fun web protocols. One easy to use webbrowser in R is phantomJS (http://phantomjs.org/).

Consider the code below. First I point to the phantomJS executable (and am able to specify a custom proxy!), I create the driver and open a session. PhanomJS is great in part because it's 'headless' so you won't see any extra windows. Then you instruct your phantom web browser to navigate to your url and you grab the source.

pJS = phantom(pjs_cmd="C:/phantomjs2/bin/phantomjs.exe",extras="--proxy=localhost:3128")
remDr = remoteDriver(browserName = 'phantomjs')
remDr$open()

remDr$navigate(url)
soup = remDr$getPageSource()

In general, for me, this has fixed 90% of web access issues like the one you describe.

R Webscraping with dynamic tables

Answers (1)

Related Questions