Reputation: 63
I am trying to scrape the report log table from the following website "https://www.heritageunits.com/Locomotive/Detail/NS8098" using the RCurl package with the attached code. It pulls in elements from the page, but when I scroll through the 10 items in the list stored under "page", none of the elements include the table.
library("RCurl")
# Read page
page <- GET(
url="https://heritageunits.com/Locomotive/Detail/NS8098",
config(cainfo = cafile), ssl.verifyhost = FALSE
)
I would also like to scrape the data from the tables on this page when you toggle to the reports from the previous days, but am not sure how to code this in R to select any previous report pages. Any help would be appreciated. Thanks.
Upvotes: 3
Views: 888
Reputation: 63
Building off of what JackStat outlined above, I made a modification to the page determination scheme to pick up units where there are less than 5 pages (JackStat's algorithm will throw an error). I also set it up with an import to pull in which units of interest are to be tracked. There are comments added for steps to get this to run on a Windows PC.
library(RSelenium)
library(XML)
library(foreach)
### Insure that the selenium-server-standalone.jar file and the Google Chrome driver are in the same folder
### as the Windows command directory setting
### Open Windows command
### Type in "java -jar selenium-server-standalone.jar" and hit enter
setwd("H:/heritage_units")
hu <- read.table("hu_tracked_101316.csv", sep = ",", header = TRUE, colClasses = "character")
hu.c <- hu[, 1]
# Start Selenium server
checkForServer()
startServer()
remDr <-
remoteDriver(
remoteServerAddr = "localhost"
, port = 4444
, browserName = "chrome"
)
remDr$open()
master <- data.frame('Spotted On'=factor(), 'Location'=factor(), 'Directon'=factor(), 'Train No'=factor(), 'Leading'=factor(), 'Spotter Reputation'=factor(), 'Heritage Unit'=character())
for (u in seq_along(hu.c)) {
url <- paste("https://www.heritageunits.com/Locomotive/Detail/", hu.c[u], sep="")
print(hu.c[u])
# Navigate to page
remDr$navigate(url)
# Snag the html
outhtml <- remDr$findElement(using = 'xpath', "//*")
out<-outhtml$getElementAttribute("outerHTML")[[1]]
# Parse with RCurl
doc<-htmlParse(out, encoding = "UTF-8")
# get the last page so we can cycle through
PageNodes <- getNodeSet(doc, '//*[(@id = "history_paginate")]')
Pages <- sapply(X = PageNodes, FUN = xmlValue)
# Find horizontal ellipsis in page information
sc <- 0
for (j in 1:nchar(Pages)){
if (!(grepl("[[:alpha:]]", substr(Pages, j, j)) | grepl("[[:digit:]]", substr(Pages, j, j)))){
sc <- j
}
}
if (sc==0) {
posN <- gregexpr(pattern ='N', Pages)
LastPage <- substr(Pages, posN[[1]]-1, posN[[1]]-1)
}else{
posN <- gregexpr(pattern ='N', Pages)
LastPage <- substr(Pages, sc+1, posN[[1]]-1)
}
temp1 <- readHTMLTable(doc)$history
temp1$'Heritage Unit' <- hu.c[u]
for (i in 2:LastPage){
nextpage <- remDr$findElement("css selector", '#history_next')
nextpage$sendKeysToElement(list(key ="enter"))
# Take it slow so it gets each page
Sys.sleep(.50)
outhtml <- remDr$findElement(using = 'xpath', "//*")
out<-outhtml$getElementAttribute("outerHTML")[[1]]
# Parse with RCurl
doc<-htmlParse(out, encoding = "UTF-8")
temp2 <- readHTMLTable(doc)$history
temp2$'Heritage Unit' <- hu.c[u]
temp1 <- rbind(temp1, temp2)
}
master <- rbind(master, temp1)
}
write.csv(master, "hu_sel_date.csv")
Upvotes: 0
Reputation: 3947
Missed by a few minutes. I took the RSelenium snippet found on another question and altered to suit. I think this one's a little shorter though. I didn't hit any issues with the page not loading.
## required packages
library(RSelenium)
library(rvest)
library(magrittr)
library(dplyr)
## start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()
## send Selenium to the page
remDr$navigate("https://www.heritageunits.com/Locomotive/Detail/NS8098")
## get the page html
page_source <- remDr$getPageSource()
## parse it and extract the table, convert to data.frame
read_html(page_source[[1]]) %>% html_nodes("table") %>% html_table() %>% extract2(1)
Upvotes: 1
Reputation: 1653
Occasionally I am able to find a json file in the source that you can it directly but I couldn't find one. I went with RSelenium and had it click the next button and cycle through. This method is frail so you have to pay attention when you run it. If the datatable is not fully loaded it will duplicate that last page so I used a small Sys.sleep to make sure that it waited long enough. I would recommend checking for duplicate rows at the end to catch this. Again it is frail but it works
library(RSelenium)
library(XML)
library(foreach)
# Start Selenium server
checkForServer()
startServer()
remDr <-
remoteDriver(
remoteServerAddr = "localhost"
, port = 4444
, browserName = "chrome"
)
remDr$open()
# Navigate to page
remDr$navigate("https://www.heritageunits.com/Locomotive/Detail/NS8098")
# Snag the html
outhtml <- remDr$findElement(using = 'xpath', "//*")
out<-outhtml$getElementAttribute("outerHTML")[[1]]
# Parse with RCurl
doc<-htmlParse(out, encoding = "UTF-8")
# get the last page so we can cycle through
PageNodes <- getNodeSet(doc, '//*[(@id = "history_paginate")]')
Pages <- sapply(X = PageNodes, FUN = xmlValue)
LastPage = as.numeric(gsub('Previous12345\\…(.*)Next', '\\1',Pages))
# loop through one click at a time
Locomotive <- foreach(i = 1:(LastPage-1), .combine = 'rbind', .verbose = TRUE) %do% {
if(i == 1){
readHTMLTable(doc)$history
} else {
nextpage <- remDr$findElement("css selector", '#history_next')
nextpage$sendKeysToElement(list(key ="enter"))
# Take it slow so it gets each page
Sys.sleep(.50)
outhtml <- remDr$findElement(using = 'xpath', "//*")
out<-outhtml$getElementAttribute("outerHTML")[[1]]
# Parse with RCurl
doc<-htmlParse(out, encoding = "UTF-8")
readHTMLTable(doc)$history
}
}
Upvotes: 3