Reputation: 147
I am trying to create a dataframe of color IDs, description, and dates from this site, which takes day and month input through dropdown menus and returns, I think, a dynamic JS generated page. I'm new to coding and thought this would be a fun toy project. I'd like to use RSelenium to automate the dropdown selection, and rvest to scrape the generated content. The data frame structure I'm hoping for will look like:
description, date, meta
"paragraph about birthday", Jun 01, "DAFFODIL PANTONE 17-1512 POWERFUL KNOWING EXPRESSIVE"
I'm attempting to first use a for loop to just iterate through each month of the year on a single day then work my way up to get every day for every month.
I'm stuck on simply getting the loop to iterate through each month, and getting the content. I could use some conceptual help first on this part of the task and appreciate any insight!
library(RSelenium)
library(rvest)
library(tidyverse)
library(xml2)
## first run: docker run -d -p 4445:4444 selenium/standalone-chrome
## open a new connection to Chrome
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__") #Entering our URL gets the browser to navigate to the page
remDr$screenshot(display = TRUE)
#### create list of month/days
month_day<- read_html(remDr$getPageSource()[[1]])
page_i <- month_day %>%
html_nodes(".list") %>%
html_children() %>%
html_text()
months <- page_i[1:12]
months <- (paste("'", months,"'", sep=''))
days <- page_i[13:43]
days <- as.numeric(days)
## create an object for month xpath elements
for (m in months){
elements <- paste0("//option[contains(text(),",months,")]")
}
## attempt at loop
total <- data.frame()
for (e in elements){
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__")
print(e)
month <- remDr$findElement(using = 'xpath', e)
month$clickElement()
day <- remDr$findElement(using = 'xpath', "//select[@id='lstDay']//option[5]") ## arbitrarily picking the 5th of each month
day$clickElement()
submit <- remDr$findElement(using = 'xpath', "/html[1]/body[1]/form[1]/div[1]/a[1]")
submit$clickElement()
html <- read_html(remDr$getPageSource()[[1]])
description <- html %>% html_nodes(xpath = "//tr//tr[2]//td[1]") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
meta <- html %>% html_nodes(xpath = "//td[@id='tdBg']") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
date <- html %>% html_nodes(xpath = "//td[@id='bgHeaderDate']//div") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
df <- data.frame(cbind(description,meta,date))
total <- rbind(total, df)
}
Not getting any errors but the results are unexpected each time. Either it would repeat on a single month/day combination like Jan05 * 12 times or jan05 * 3 times, Apr 05 *3 times, etc.
Upvotes: 3
Views: 1405
Reputation: 147
Found a reasonable solution! It's not perfect but it get's me a lot closer than I was before. I ended up writing a function per @QHarr's suggestion and using their rvest pattern:
library(rvest)
colorstrology <- function(i,j){
body <- list('month' = i,'day' = j)
url <- 'https://www.pantone.com/pages/iphone/iphone_colorstrology_results.aspx'
page <- html_session(url) %>%
rvest:::request_POST(url, body = body, encode = "form") %>%
read_html()
date <- page %>% html_node('table table td') %>% html_text() %>%
gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
description <- page %>% html_node('tr:nth-of-type(2) div') %>% html_text() %>%
gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
meta <- page %>% html_nodes('#tdBg span') %>% html_text()
df <- data.frame(date, description, meta)
}
months <- c(1:12)
days <- c(1:31)
df <- data.frame(date, description, meta)
for (m in months){
for (d in days){
temp <- colorstrology(m,d)
df <- rbind(temp, df)
}
}
Upvotes: 1
Reputation: 84455
I will come back and update this to pick up on my suggestions. Navigate to that page then open the dev tools in a browser, say Chrome, with F12 and go to the network tab. Then, select a month and date and hit View Now. You will see traffic appear in the network tab. The page makes a POST xhr request to get the content you see after clicking the view icon.
The POST request itself is very simple and has a body (form) that comprises of the month and the day you selected:
So, you can mimic that POST request and then parse the response. An example for the date you mentioned could be:
library(rvest)
body <- list('month' = 6,'day' = 1)
url <- 'https://www.pantone.com/pages/iphone/iphone_colorstrology_results.aspx'
page <- html_session(url) %>%
rvest:::request_POST(url, body = body, encode = "form") %>%
read_html()
date <- page %>% html_node('table table td') %>% html_text() %>%
gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
description <- page %>% html_node('tr:nth-of-type(2) div') %>% html_text() %>%
gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
meta <- page %>% html_nodes('#tdBg span') %>% html_text()
df <- data.frame(date, description, meta)
Now, and this is what I will revisit later, the above could be converted into a function which returns a list or df that can be combined into a final dataframe. You could generate each body in advance and pass as an argument to the function. I would look at using a Session object, http Session, for the efficiency of re-using the current connection. The month and days could be updated in the form body during a loop/nestd loop - depending on how they are too be generated. I am new to R and know it doesn't have dictionaries but perhaps it has named lists, or some such, whereby you can scrape month: possible values associations from the original page to use in looping. I would welcome learning from more experienced R people how the above might be achieved - there are some gaps in my R knowledge to complete this to address today. Someone may post an answer along similar lines which would be helpful.
Generating the POST request bodies:
Looking at the dropdowns it is for a standard year so you can generate the required POST bodies in a nested for loop. I use 1,12 for months and lubridate to return days in month based on standard year:
library(lubridate)
for(i in seq(1,12)){
date <- as.Date(gsub('placeholder',i, "2019-placeholder-01"), "%Y-%m-%d")
days <- days_in_month(date)[[1]]
for(j in seq(1,days)){
body = list('month' = i,'day' = j)
# pass body to function or add to an iterable for later looping
}
}
Upvotes: 4