Reputation: 599
This question is similar to the one I asked here. I've tried the techniques from the answer, but for some reason I'm getting no matches to my xpath
and am looking for some additional guidance.
I am trying to scrape the URLs of all the downloadable CSV files from the following webpage: Page with CSV Files. Here's what I've tried so far.
library(rvest)
library(dplyr)
myURL <- 'https://marketplace.spp.org/pages/rtbm-lmp-by-location#%2F2017%2F11%2FBy_Day'
attempt1 <- read_html( myURL ) %>%
html_nodes( xpath='//*[contains(@class, "f-csv")]/..' ) %>%
html_attr('href')
attempt2 <- read_html( myURL ) %>%
html_nodes( xpath='//*[contains(@class, "files") and contains(@href, ".csv")]' ) %>%
html_attr('href')
Both attempts return empty character vectors. The calls to html_nodes()
both return lists of zero elements.
When I search for the filemanager
class in the page, I get a nonempty list, but I'm having trouble drilling down further.
This returns a non-empty list:
filemanager <- read_html( myURL ) %>%
html_nodes( xpath='//*[contains(@class, "filemanager")]' )
From right-clicking the download icon, I see the URL for the file dated 2017-11-01 (note the filename is different from the displayed update time) should be: https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%2F2017%2F11%2FBy_Day%2FRTBM-LMP-DAILY-SL-20171101.csv. This downloads the CSV for me when I click it.
Any thoughts on how to return the download URLs for the CSV files?
Upvotes: 1
Views: 117
Reputation: 78832
The site makes an XHR request and dynamically builds those blocks, so no amount of plain web scraping is going to work.
Hack-grabbing the XHR request that it makes isn't going to work either b/c the site is coded semi-well and has extensive CSRF protection.
So, we need to get one normal page of the site which has CSRF metadata in it, extract those tokens, then make a faux XHR request to the same target the page does.
The site was also pretty uniform (another sign they did OK w/r/t building the site/app) so it was fairly easy to make a generic function that lets you input year, month and "type" (I made some guesses based on the info in the 'boxes' on the site) and get a list of the contents of the CSV files back as data frames. By default, it uses the current year and month and defaults to "By_Day
".
The CSVs take a bit to download, so it prints a message for each one as it downloads. You may not need that functionality, but it shld be straightforward to do what you need with the initial XHR response.
I tried to keep the dependencies to a minimum, but purrr
and dplyr
wld make for (IMO) better additions.
rtbm_lmp_by_location <- function(year = format(Sys.Date(), "%Y"),
month = format(Sys.Date(), "%m"),
by_type = c("By_Day", "By_Interval", "RePrice")) {
require(rvest)
require(jsonlite)
require(httr)
by_type <- match.arg(by_type, c("By_Day", "By_Interval", "RePrice"))
res <- GET("https://marketplace.spp.org/pages/rtbm-lmp-by-location")
doc <- content(res)
x_csrf_token <- html_attr(html_node(doc, "meta[id='_csrf']"), "content")
x_spp_csrf_token <- html_attr(html_node(doc, "meta[id='_spp_csrf']"), "content")
POST(
url = "https://marketplace.spp.org/file-api/",
add_headers(
Host = "marketplace.spp.org",
Referer = "https://marketplace.spp.org/pages/rtbm-lmp-by-location",
`X-CSRF-TOKEN` = x_csrf_token,
`X-SPP-CSRF-TOKEN` = x_spp_csrf_token,
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
name = "rtbm-lmp-by-location",
fsName = "rtbm-lmp-by-location",
type = "folder",
path = sprintf("/%s/%s/%s", year, month, by_type)
),
encode = "json"
) -> res
res <- content(res, as="text")
res <- jsonlite::fromJSON(res, flatten=TRUE)
res$path <- sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%s",
res$path)
lapply(res$path, function(.x) {
message(sprintf("Downloading <%s>...", .x))
read.csv(.x, stringsAsFactors=FALSE)
})
}
fils <- rtbm_lmp_by_location()
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171101.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171102.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171103.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171104.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171105.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171106.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171107.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>...
And, here's the data it pulled back:
str(fils)
## List of 8
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 16.2 18 21.3 15.9 21.4 ...
## ..$ MLC : num [1:272448] 0.8342 0.3078 0.0295 0.5966 -0.1799 ...
## ..$ MCC : num [1:272448] 0 2.34 5.94 0 6.19 ...
## ..$ MEC : num [1:272448] 15.3 15.3 15.3 15.3 15.3 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 17.1 16.7 16.6 16.8 16.4 ...
## ..$ MLC : num [1:272448] 0.5527 0.1549 0.0498 0.2766 -0.1663 ...
## ..$ MCC : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ MEC : num [1:272448] 16.6 16.6 16.6 16.6 16.6 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 18.9 18.3 17.8 18.6 17.5 ...
## ..$ MLC : num [1:272448] 0.819 0.191 -0.221 0.566 -0.584 ...
## ..$ MCC : num [1:272448] 0 0 0 0 0 0 0 0 -0.0076 0 ...
## ..$ MEC : num [1:272448] 18.1 18.1 18.1 18.1 18.1 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 0.0107 4.4038 5.2691 0.0108 5.1795 ...
## ..$ MLC : num [1:272448] 3e-04 2e-04 0e+00 4e-04 -1e-04 1e-04 -1e-04 0e+00 1e-04 3e-04 ...
## ..$ MCC : num [1:272448] 0 4.39 5.26 0 5.17 ...
## ..$ MEC : num [1:272448] 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0105 0.0104 ...
## $ :'data.frame': 283800 obs. of 8 variables:
## ..$ Interval : chr [1:283800] "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:283800] "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" ...
## ..$ Settlement.Location.Name: chr [1:283800] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:283800] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:283800] 12 14.7 18.4 12.1 18.6 ...
## ..$ MLC : num [1:283800] 0.4667 0.3877 0.2521 0.5528 0.0704 ...
## ..$ MCC : num [1:283800] 0.0008 2.8321 6.6661 0 7.0045 ...
## ..$ MEC : num [1:283800] 11.5 11.5 11.5 11.5 11.5 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 19.6 19.1 19.1 19.5 18.7 ...
## ..$ MLC : num [1:272448] 0.0621 0.0153 0.0905 0.2946 -0.2689 ...
## ..$ MCC : num [1:272448] 0.6728 0.2223 0.0854 0.282 0.0761 ...
## ..$ MEC : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 21.5 20.2 19.6 21.1 19.2 ...
## ..$ MLC : num [1:272448] 0.232 -0.277 -0.62 0.344 -0.985 ...
## ..$ MCC : num [1:272448] 0 -0.819 -1.145 -0.58 -1.156 ...
## ..$ MEC : num [1:272448] 21.3 21.3 21.3 21.3 21.3 ...
## $ :'data.frame': 272448 obs. of 8 variables:
## ..$ Interval : chr [1:272448] "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
## ..$ GMT.Interval : chr [1:272448] "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
## ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## ..$ PNODE.Name : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## ..$ LMP : num [1:272448] 19 19 18.9 19.4 18.6 ...
## ..$ MLC : num [1:272448] 0.1562 0.1086 0.0215 0.465 -0.3251 ...
## ..$ MCC : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ MEC : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...
Having done all that and seeing the URLs it retrieves:
https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv
You can really not even bother with all ^^ and just sprintf()
or glue
the variable path components together for any given day of the year:
rtbm_lmp_by_location_by_day <- function(date) {
date <- as.Date(date)
y <- format(date, "%Y")
m <- as.numeric(format(date, "%m"))
ymd <- format(date, "%Y%m%d")
sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/%s/%s/By_Day/RTBM-LMP-DAILY-SL-%s.csv",
y, m, ymd) -> fil
res <- httr::HEAD(fil)
if (httr::status_code(res) != 200) {
message("File not found")
return(invisible(NULL))
} else {
message(sprintf("Downloading <%s>", fil))
read.csv(fil, stringsAsFactors=FALSE)
}
}
xdf <- rtbm_lmp_by_location_by_day("2017-11-08")
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>
str(xdf)
## 'data.frame': 272448 obs. of 8 variables:
## $ Interval : chr "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
## $ GMT.Interval : chr "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
## $ Settlement.Location.Name: chr "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
## $ PNODE.Name : chr "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
## $ LMP : num 19 19 18.9 19.4 18.6 ...
## $ MLC : num 0.1562 0.1086 0.0215 0.465 -0.3251 ...
## $ MCC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MEC : num 18.9 18.9 18.9 18.9 18.9 ...
It shld be a straightforward process to figure out the patterns for the other categories.
Upvotes: 1