pocketprotector
pocketprotector

Reputation: 147

Web-scraping dynamic Javascript page with RSelenium and rvest

I am trying to create a dataframe of color IDs, description, and dates from this site, which takes day and month input through dropdown menus and returns, I think, a dynamic JS generated page. I'm new to coding and thought this would be a fun toy project. I'd like to use RSelenium to automate the dropdown selection, and rvest to scrape the generated content. The data frame structure I'm hoping for will look like:

description, date, meta
"paragraph about birthday", Jun 01, "DAFFODIL PANTONE 17-1512 POWERFUL KNOWING EXPRESSIVE"

I'm attempting to first use a for loop to just iterate through each month of the year on a single day then work my way up to get every day for every month.

I'm stuck on simply getting the loop to iterate through each month, and getting the content. I could use some conceptual help first on this part of the task and appreciate any insight!

library(RSelenium)
library(rvest)
library(tidyverse)
library(xml2)

## first run: docker run -d -p 4445:4444 selenium/standalone-chrome
## open a new connection to Chrome
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                 port = 4445L,
                                 browserName = "chrome")

remDr$open()
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__") #Entering our URL gets the browser to navigate to the page
remDr$screenshot(display = TRUE) 

#### create list of month/days
 month_day<- read_html(remDr$getPageSource()[[1]])
 page_i <- month_day %>%
   html_nodes(".list") %>%
   html_children() %>% 
   html_text()

months <- page_i[1:12]
months <- (paste("'", months,"'", sep=''))
days <- page_i[13:43]
days <- as.numeric(days)


## create an object for month xpath elements
for (m in months){
  elements <- paste0("//option[contains(text(),",months,")]")
}

## attempt at loop

total <- data.frame()

for (e in elements){
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__") 
      print(e)
      month <- remDr$findElement(using = 'xpath', e)
      month$clickElement()
      day <- remDr$findElement(using = 'xpath', "//select[@id='lstDay']//option[5]") ## arbitrarily picking the 5th of each month
      day$clickElement()
      submit <- remDr$findElement(using = 'xpath', "/html[1]/body[1]/form[1]/div[1]/a[1]")
      submit$clickElement()
      html <- read_html(remDr$getPageSource()[[1]])
      description <- html %>%  html_nodes(xpath = "//tr//tr[2]//td[1]") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
      meta <- html %>% html_nodes(xpath = "//td[@id='tdBg']") %>%  html_text() %>% gsub("^\\s+|\\s+$", "", .) 
      date <- html %>% html_nodes(xpath = "//td[@id='bgHeaderDate']//div") %>%  html_text() %>% gsub("^\\s+|\\s+$", "", .)
      df <- data.frame(cbind(description,meta,date))
      total <- rbind(total, df)
}

Not getting any errors but the results are unexpected each time. Either it would repeat on a single month/day combination like Jan05 * 12 times or jan05 * 3 times, Apr 05 *3 times, etc.

Upvotes: 3

Views: 1405

Answers (2)

pocketprotector
pocketprotector

Reputation: 147

Found a reasonable solution! It's not perfect but it get's me a lot closer than I was before. I ended up writing a function per @QHarr's suggestion and using their rvest pattern:

library(rvest)

colorstrology <- function(i,j){

  body <- list('month' = i,'day' = j)
  url <- 'https://www.pantone.com/pages/iphone/iphone_colorstrology_results.aspx'
  page <- html_session(url) %>%
    rvest:::request_POST(url, body = body, encode = "form") %>%
    read_html()

  date <- page %>% html_node('table table td') %>% html_text() %>% 
    gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
  description <- page %>% html_node('tr:nth-of-type(2) div') %>% html_text() %>% 
    gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
  meta <- page %>% html_nodes('#tdBg span') %>% html_text()

  df <- data.frame(date, description, meta)
}



months <- c(1:12)
days <- c(1:31)

df <- data.frame(date, description, meta)
for (m in months){
  for (d in days){
    temp <- colorstrology(m,d)
    df <- rbind(temp, df)
}
}



Upvotes: 1

QHarr
QHarr

Reputation: 84455

I will come back and update this to pick up on my suggestions. Navigate to that page then open the dev tools in a browser, say Chrome, with F12 and go to the network tab. Then, select a month and date and hit View Now. You will see traffic appear in the network tab. The page makes a POST xhr request to get the content you see after clicking the view icon.

enter image description here

The POST request itself is very simple and has a body (form) that comprises of the month and the day you selected:

enter image description here

So, you can mimic that POST request and then parse the response. An example for the date you mentioned could be:

library(rvest)

body <- list('month' = 6,'day' = 1)
url <- 'https://www.pantone.com/pages/iphone/iphone_colorstrology_results.aspx'
page <- html_session(url) %>%
  rvest:::request_POST(url, body = body, encode = "form") %>%
  read_html()

date <- page %>% html_node('table table td') %>% html_text() %>% 
  gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
description <- page %>% html_node('tr:nth-of-type(2) div') %>% html_text() %>% 
  gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
meta <- page %>% html_nodes('#tdBg span') %>% html_text()

df <- data.frame(date, description, meta)

Now, and this is what I will revisit later, the above could be converted into a function which returns a list or df that can be combined into a final dataframe. You could generate each body in advance and pass as an argument to the function. I would look at using a Session object, http Session, for the efficiency of re-using the current connection. The month and days could be updated in the form body during a loop/nestd loop - depending on how they are too be generated. I am new to R and know it doesn't have dictionaries but perhaps it has named lists, or some such, whereby you can scrape month: possible values associations from the original page to use in looping. I would welcome learning from more experienced R people how the above might be achieved - there are some gaps in my R knowledge to complete this to address today. Someone may post an answer along similar lines which would be helpful.


Generating the POST request bodies:

Looking at the dropdowns it is for a standard year so you can generate the required POST bodies in a nested for loop. I use 1,12 for months and lubridate to return days in month based on standard year:

library(lubridate)

for(i in seq(1,12)){
  date <- as.Date(gsub('placeholder',i, "2019-placeholder-01"), "%Y-%m-%d")
  days <- days_in_month(date)[[1]]
  for(j in seq(1,days)){
    body = list('month' = i,'day' = j)
    # pass body to function or add to an iterable for later looping
  }
}

Upvotes: 4

Related Questions