DalStars16
DalStars16

Reputation: 41

Webscraping in R

I'm working on a project studying state issued municipal bonds but I am having trouble getting my data. Using the XML package and the below code I was able to get some of it.

> nys="http://newyork.municipalbonds.com/bonds/issue/649787N87"
> nys.table=readHTMLTable(nys,asText=TRUE,which=4)
> nys.table=as.data.frame(nys.table)
> head(nys.table)
  Trade Date Trade Time Maturity Date Coupon   Price Yield Trade Amount      Trade Type
1 2012-09-27     2:49pm      2013-Apr 5.000% 102.522 0.289     $270,000 Investor bought
2 2012-09-27     1:17pm      2013-Apr 5.000% 102.290 0.712      $45,000    Inter-dealer

But that site only offers a small sample for free. The official website, EMMA has the data for free but I'm having a terrible time scraping it. When I try the same approach as before I end up with

nys="http://emma.msrb.org/SecurityView/SecurityDetailsTrades.aspx?cusip=649787N87"
nys.table=readHTMLTable(nys,asText=TRUE)
nys.table=as.data.frame(nys.table)
head(nys.table)

data frame with 0 columns and 0 rows

From what I understand, and I'm fairly certain about this, is that there is a standard T&C page when you navigate to it via web browser. After using htmlParse(nys), the output is identical to the page source code of the T&C page and not the page where the data is actually located. So when the code runs, it is trying to find tables on the T&C page.

I figured that this would be a fairly common problem but so far I have not been able to find any posts where someone had a similar issue. If someone could point me in the right direction, I'd be greatly appreciative.

Upvotes: 3

Views: 1394

Answers (1)

nograpes
nograpes

Reputation: 18323

I finally got it to work. I had to use Web Developer in Firefox which allowed me to see what name/value pair the site was setting for the Disclaimer cookie. Here it is:

library(RCurl)
nys="http://emma.msrb.org/SecurityView/SecurityDetailsTrades.aspx?cusip=649787N87"
txt<-getURLContent(nys,cookie='Disclaimer=Ratings')
readHTMLTable(htmlParse(txt, asText = TRUE)) 

$ctl00_mainContentArea_tradeSearchResults
        Trade Date/Time   Settlement Date Price (%) Yield (%) Trade Amt ($) Trade Submission Type  
1   09/27/2012 : 02:49 PM      10/02/2012  102.5220     0.289       270,000       Customer bought  
2   09/27/2012 : 01:17 PM      10/02/2012    102.29     0.712        45,000    Inter-dealer Trade  
3   09/27/2012 : 01:17 PM      10/02/2012    102.29     0.712        45,000    Inter-dealer Trade  

To get the next 100 rows, you have to post a form with the current "viewstate":

# Get next set
viewstate=gsub('.*\"__VIEWSTATE\" value=\"([^\"]*)\".*','\\1',txt)

txt<-postForm(nys,
"__VIEWSTATE"=viewstate,
"__EVENTTARGET"="ctl00$mainContentArea$nextBottomButton",
.opts=list(cookie='Disclaimer=Ratings'))
readHTMLTable(htmlParse(txt, asText = TRUE)) 

$ctl00_mainContentArea_tradeSearchResults
        Trade Date/Time   Settlement Date Price (%) Yield (%) Trade Amt ($) Trade Submission Type  
1   06/27/2011 : 01:51 PM      06/30/2011  107.7350      0.65       600,000         Customer sold  
2   06/22/2011 : 12:05 PM      06/27/2011  107.1960     0.957         8,000       Customer bought  
3   06/22/2011 : 12:05 PM      06/27/2011  106.6960     1.226         8,000    Inter-dealer Trade  

Upvotes: 6

Related Questions