Samo
Samo

Reputation: 2085

Extract table from

I would like to extract the following table using rvest from http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp (for any date):

enter image description here

I tried the following but failed to produce any result:

library(rvest)

url <- "http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp"

htmlSession <-html_session(url)            ## create session

goForm <- html_form(htmlSession)[[2]]   ## pull form from session

#filledGoForm <- set_values(goForm, value="04/26/2017") # This does not work

filledGoForm <- goForm
filledGoForm$fields[[1]]$value <- "04/26/2017"

htmlSession <- submit_form(htmlSession, filledGoForm)

> htmlSession <- submit_form(htmlSession, filledGoForm)
Submitting with ''
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode,  :
  Not Found (HTTP 404).

Any hints on how to do this highly appreciated.

Upvotes: 2

Views: 328

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

That site uses many XHR requests to populate the tables. And, it establishes a server session with a hidden POST request which won't be replicated with html_session().

We'll need to add in httr for some help:

library(httr)
library(rvest)

The first thing we need to do is to just hit the site to get an initial qs_wid cookie into the implicit cookie jar curl/httr/rvest share:

init <- GET("http://finra-markets.morningstar.com/MarketData/Default.jsp")

Next, we need to mimic the hidden "login" that the web page does:

nxt <- POST(url = "http://finra-markets.morningstar.com/finralogin.jsp", 
            body = list(redirectPage = "/BondCenter/TRACEMarketAggregateStats.jsp"), 
            encode = "form")

That creates a session on the server back-end and places a few other cookies in our cookie jar.

Finally:

GET(
  url = "http://finra-markets.morningstar.com/transferPage.jsp", 
  query = list(
    `path`="http://muni-internal.morningstar.com/public/MarketBreadth/C",
    `date`="04/24/2017",
    `_`=as.numeric(Sys.time())
  )
) -> res

makes the request. You can make a function out of all three steps (together) and parameterize that last GET.

Unfortunately, that returns a very broken HTML <table> that html_table() can't translate into a data frame automagically for you, but that shouldn't stop you:

content(res) %>%
  html_nodes("td") %>% 
  html_text() %>% 
  matrix(ncol=4, byrow=TRUE) %>% 
  as_data_frame() %>% 
  mutate_all(as.numeric) %>% 
  rename(all_issues=V1, investment_grade=V2, high_yield=V3, convertible=V4) %>% 
  mutate(category = c("total_issues_traded", "advances", "declines", "unchanged", "high_52", "low_52", "dollar_volume"))

## # A tibble: 7 × 5
##   all_issues investment_grade high_yield convertible            category
##        <dbl>            <dbl>      <dbl>       <dbl>               <chr>
## 1       7983             5602       2194         187 total_issues_traded
## 2       3025             1798       1100         127            advances
## 3       4448             3575        824          49            declines
## 4        124               42         75           7           unchanged
## 5        257               66        175          16             high_52
## 6        139              105         33           1              low_52
## 7      22601            16143       5742         715       dollar_volume

To get the other data tables, go to the Developer Tools option in your browser (switch to one that has it if yours doesn't … you're likely on Windows given that you're doing finance things and IE/Edge aren't very good browsers for introspection) and refresh the page to see the other requests that get made.

Upvotes: 2

Related Questions