Lawrence
Lawrence

Reputation: 277

web scraping a webpage with multiple tabs in R

I was trying to scrape the expiry dates data in R from the following webpage: https://www.theice.com/productguide/ProductSpec.shtml?specId=251#expiry. This page contain several tabs, the expiry date is only one of them. The code I use is

library(RCurl)
Canola <- 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251#expiry'
WS <- getURL(Canola,ssl.verifypeer=FALSE)
library(XML)
ParsedData <- htmlParse(WS)
CanolaExpDate <- readHTMLTable(ParsedData)
names(CanolaExpDate)

The final output is, however, the trading hours on the first tab-product specification.

I am new to web scraping,not knowledgeable about html.Please advise.

Upvotes: 2

Views: 1509

Answers (1)

GSee
GSee

Reputation: 49820

I searched through the source code of that page for "expiry" and saw how the URLs are formed. Adding &expiryDates instead of #expiry leads to a table that is easier to parse.

library(RCurl)
library(XML)
Canola <- "https://www.theice.com/productguide/ProductSpec.shtml?specId=251&expiryDates"
WS <- getURL(Canola)
x <- readHTMLTable(WS, stringsAsFactors=FALSE)
as.data.frame(lapply(x[[1]], as.Date, format="%a %b %d %X"))

#   Contract.Symbol        FTD        LTD        FND        LND        FDD        LDD Options.FTD Options.LTD
#1       2013-07-01 2013-05-16 2013-07-12 2013-06-28 2013-07-15 2013-07-02 2013-07-16        <NA>  2013-06-21
#2       2013-08-01 2013-03-25 2013-07-26 2013-07-31 2013-08-15 2013-08-01 2013-08-16        <NA>  2013-07-26
#3       2013-09-01 2013-08-27 2013-08-23 2013-08-30 2013-09-16 2013-09-03 2013-09-17        <NA>  2013-08-23
#4       2013-10-01 2013-05-27 2013-09-20 2013-09-30 2013-10-15 2013-10-01 2013-10-16        <NA>  2013-09-20
#5       2013-11-01 2013-07-15 2013-11-14 2013-10-31 2013-11-15 2013-11-01 2013-11-18        <NA>  2013-10-25
#6       2013-01-01 2013-11-15 2013-01-14 2013-12-31 2013-01-15 2013-01-02 2013-01-16        <NA>  2013-12-20
#7       2013-03-01 2013-01-17 2013-03-14 2013-02-28 2013-03-17 2013-03-03 2013-03-18        <NA>  2013-02-21
#8       2013-05-01 2013-03-15 2013-05-14 2013-04-30 2013-05-15 2013-05-01 2013-05-16        <NA>  2013-04-25
#9       2013-07-01 2013-05-15 2013-07-14 2013-06-30 2013-07-15 2013-07-02 2013-07-16        <NA>  2013-06-20
#10      2013-11-01 2013-07-16 2013-11-14 2013-10-31 2013-11-17 2013-11-03 2013-11-18        <NA>  2013-10-24
#11      2013-01-01 2013-11-15 2013-01-14 2013-12-31 2013-01-15 2013-01-02 2013-01-16        <NA>  2013-12-19
#12      2013-03-01 2013-01-15 2013-03-13 2013-02-27 2013-03-16 2013-03-02 2013-03-17        <NA>  2013-02-20
#13      2013-05-01 2013-03-15 2013-05-14 2013-04-30 2013-05-15 2013-05-01 2013-05-19        <NA>  2013-04-24
#14      2013-07-01 2013-05-15 2013-07-14 2013-06-30 2013-07-15 2013-07-02 2013-07-16        <NA>  2013-06-26

Edit: More on how I found the URL I used above. I actually didn't use any developer tools. I just right-clicked and selected "view source" and searched for "expiry". There's an app.urls section that has something like this

'expiry':'/productguide/ProductSpec.shtml;jsessionid=C59BE223F113CFDD340BF23CC07EEFFC?expiryDates=&specId=251'

So, I tried omitting the jsessionid part and I went to

https://theice.com/productguide/ProductSpec.shtml?expiryDates=&specId=251

and it looked interesting. I only reordered it to https://www.theice.com/productguide/ProductSpec.shtml?specId=251&expiryDates

because I thought the URL looked nicer like that.

Upvotes: 1

Related Questions