bmosov01
bmosov01

Reputation: 599

Read URLs for CSV files into vector

This question is similar to the one I asked here. I've tried the techniques from the answer, but for some reason I'm getting no matches to my xpath and am looking for some additional guidance.

I am trying to scrape the URLs of all the downloadable CSV files from the following webpage: Page with CSV Files. Here's what I've tried so far.

library(rvest)
library(dplyr)

myURL <- 'https://marketplace.spp.org/pages/rtbm-lmp-by-location#%2F2017%2F11%2FBy_Day'

attempt1 <- read_html( myURL ) %>%
  html_nodes( xpath='//*[contains(@class, "f-csv")]/..' ) %>%
  html_attr('href')

attempt2 <- read_html( myURL ) %>%
  html_nodes( xpath='//*[contains(@class, "files") and contains(@href, ".csv")]' ) %>%
  html_attr('href')

Both attempts return empty character vectors. The calls to html_nodes() both return lists of zero elements.

When I search for the filemanager class in the page, I get a nonempty list, but I'm having trouble drilling down further.

This returns a non-empty list:

filemanager <- read_html( myURL ) %>%
  html_nodes( xpath='//*[contains(@class, "filemanager")]' )

From right-clicking the download icon, I see the URL for the file dated 2017-11-01 (note the filename is different from the displayed update time) should be: https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%2F2017%2F11%2FBy_Day%2FRTBM-LMP-DAILY-SL-20171101.csv. This downloads the CSV for me when I click it.

Any thoughts on how to return the download URLs for the CSV files?

Upvotes: 1

Views: 117

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78832

The site makes an XHR request and dynamically builds those blocks, so no amount of plain web scraping is going to work.

Hack-grabbing the XHR request that it makes isn't going to work either b/c the site is coded semi-well and has extensive CSRF protection.

So, we need to get one normal page of the site which has CSRF metadata in it, extract those tokens, then make a faux XHR request to the same target the page does.

The site was also pretty uniform (another sign they did OK w/r/t building the site/app) so it was fairly easy to make a generic function that lets you input year, month and "type" (I made some guesses based on the info in the 'boxes' on the site) and get a list of the contents of the CSV files back as data frames. By default, it uses the current year and month and defaults to "By_Day".

The CSVs take a bit to download, so it prints a message for each one as it downloads. You may not need that functionality, but it shld be straightforward to do what you need with the initial XHR response.

I tried to keep the dependencies to a minimum, but purrr and dplyr wld make for (IMO) better additions.

rtbm_lmp_by_location <- function(year = format(Sys.Date(), "%Y"),
                                 month = format(Sys.Date(), "%m"),
                                 by_type = c("By_Day", "By_Interval", "RePrice")) {

  require(rvest)
  require(jsonlite)
  require(httr)

  by_type <- match.arg(by_type, c("By_Day", "By_Interval", "RePrice"))

  res <- GET("https://marketplace.spp.org/pages/rtbm-lmp-by-location")
  doc <- content(res)

  x_csrf_token <- html_attr(html_node(doc, "meta[id='_csrf']"), "content")
  x_spp_csrf_token <- html_attr(html_node(doc, "meta[id='_spp_csrf']"), "content")

  POST(
    url = "https://marketplace.spp.org/file-api/", 
    add_headers(
      Host = "marketplace.spp.org", 
      Referer = "https://marketplace.spp.org/pages/rtbm-lmp-by-location",
      `X-CSRF-TOKEN` = x_csrf_token,
      `X-SPP-CSRF-TOKEN` = x_spp_csrf_token,
      `X-Requested-With` = "XMLHttpRequest"
    ), 
    body = list(
      name = "rtbm-lmp-by-location",
      fsName = "rtbm-lmp-by-location", 
      type = "folder", 
      path = sprintf("/%s/%s/%s", year, month, by_type)
    ), 
    encode = "json"
  ) -> res

  res <- content(res, as="text")
  res <- jsonlite::fromJSON(res, flatten=TRUE)
  res$path <- sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=%s",
                      res$path)

  lapply(res$path, function(.x) {
    message(sprintf("Downloading <%s>...", .x))
    read.csv(.x, stringsAsFactors=FALSE)
  })

}

fils <- rtbm_lmp_by_location()
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171101.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171102.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171103.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171104.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171105.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171106.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171107.csv>...
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>...

And, here's the data it pulled back:

str(fils)
## List of 8
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" "11/01/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" "11/01/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 16.2 18 21.3 15.9 21.4 ...
##   ..$ MLC                     : num [1:272448] 0.8342 0.3078 0.0295 0.5966 -0.1799 ...
##   ..$ MCC                     : num [1:272448] 0 2.34 5.94 0 6.19 ...
##   ..$ MEC                     : num [1:272448] 15.3 15.3 15.3 15.3 15.3 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" "11/02/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" "11/02/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 17.1 16.7 16.6 16.8 16.4 ...
##   ..$ MLC                     : num [1:272448] 0.5527 0.1549 0.0498 0.2766 -0.1663 ...
##   ..$ MCC                     : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ MEC                     : num [1:272448] 16.6 16.6 16.6 16.6 16.6 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" "11/03/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" "11/03/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 18.9 18.3 17.8 18.6 17.5 ...
##   ..$ MLC                     : num [1:272448] 0.819 0.191 -0.221 0.566 -0.584 ...
##   ..$ MCC                     : num [1:272448] 0 0 0 0 0 0 0 0 -0.0076 0 ...
##   ..$ MEC                     : num [1:272448] 18.1 18.1 18.1 18.1 18.1 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" "11/04/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" "11/04/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 0.0107 4.4038 5.2691 0.0108 5.1795 ...
##   ..$ MLC                     : num [1:272448] 3e-04 2e-04 0e+00 4e-04 -1e-04 1e-04 -1e-04 0e+00 1e-04 3e-04 ...
##   ..$ MCC                     : num [1:272448] 0 4.39 5.26 0 5.17 ...
##   ..$ MEC                     : num [1:272448] 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0104 0.0105 0.0104 ...
##  $ :'data.frame': 283800 obs. of  8 variables:
##   ..$ Interval                : chr [1:283800] "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" "11/05/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:283800] "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" "11/05/2017 05:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:283800] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:283800] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:283800] 12 14.7 18.4 12.1 18.6 ...
##   ..$ MLC                     : num [1:283800] 0.4667 0.3877 0.2521 0.5528 0.0704 ...
##   ..$ MCC                     : num [1:283800] 0.0008 2.8321 6.6661 0 7.0045 ...
##   ..$ MEC                     : num [1:283800] 11.5 11.5 11.5 11.5 11.5 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" "11/06/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" "11/06/2017 06:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 19.6 19.1 19.1 19.5 18.7 ...
##   ..$ MLC                     : num [1:272448] 0.0621 0.0153 0.0905 0.2946 -0.2689 ...
##   ..$ MCC                     : num [1:272448] 0.6728 0.2223 0.0854 0.282 0.0761 ...
##   ..$ MEC                     : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" "11/07/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" "11/07/2017 06:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 21.5 20.2 19.6 21.1 19.2 ...
##   ..$ MLC                     : num [1:272448] 0.232 -0.277 -0.62 0.344 -0.985 ...
##   ..$ MCC                     : num [1:272448] 0 -0.819 -1.145 -0.58 -1.156 ...
##   ..$ MEC                     : num [1:272448] 21.3 21.3 21.3 21.3 21.3 ...
##  $ :'data.frame': 272448 obs. of  8 variables:
##   ..$ Interval                : chr [1:272448] "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
##   ..$ GMT.Interval            : chr [1:272448] "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
##   ..$ Settlement.Location.Name: chr [1:272448] "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##   ..$ PNODE.Name              : chr [1:272448] "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##   ..$ LMP                     : num [1:272448] 19 19 18.9 19.4 18.6 ...
##   ..$ MLC                     : num [1:272448] 0.1562 0.1086 0.0215 0.465 -0.3251 ...
##   ..$ MCC                     : num [1:272448] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ MEC                     : num [1:272448] 18.9 18.9 18.9 18.9 18.9 ...

Having done all that and seeing the URLs it retrieves:

https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv

You can really not even bother with all ^^ and just sprintf() or glue the variable path components together for any given day of the year:

rtbm_lmp_by_location_by_day <- function(date) {
  date <- as.Date(date)
  y <- format(date, "%Y")
  m <- as.numeric(format(date, "%m"))
  ymd <- format(date, "%Y%m%d")
  sprintf("https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/%s/%s/By_Day/RTBM-LMP-DAILY-SL-%s.csv",
          y, m, ymd) -> fil
  res <- httr::HEAD(fil)
  if (httr::status_code(res) != 200) {
    message("File not found")
    return(invisible(NULL))
  } else {
    message(sprintf("Downloading <%s>", fil))
    read.csv(fil, stringsAsFactors=FALSE)
  }
}

xdf <- rtbm_lmp_by_location_by_day("2017-11-08")
## Downloading <https://marketplace.spp.org/file-api/download/rtbm-lmp-by-location?path=/2017/11/By_Day/RTBM-LMP-DAILY-SL-20171108.csv>

str(xdf)
## 'data.frame': 272448 obs. of  8 variables:
##  $ Interval                : chr  "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" "11/08/2017 00:05:00" ...
##  $ GMT.Interval            : chr  "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" "11/08/2017 06:05:00" ...
##  $ Settlement.Location.Name: chr  "AEC" "AECC_CSWS" "AECC_ELKINS" "AECC_FITZHUGH" ...
##  $ PNODE.Name              : chr  "SOUC" "CSWS_AECC_LA" "CSWSELKINSUNELKINS_RA" "CSWSFITZHUGHPLT1" ...
##  $ LMP                     : num  19 19 18.9 19.4 18.6 ...
##  $ MLC                     : num  0.1562 0.1086 0.0215 0.465 -0.3251 ...
##  $ MCC                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MEC                     : num  18.9 18.9 18.9 18.9 18.9 ...

It shld be a straightforward process to figure out the patterns for the other categories.

Upvotes: 1

Related Questions