Lisa Ann
Lisa Ann

Reputation: 3485

What makes table web scraping with rvest package sometimes fail?

I'm playing with rvest package and trying to figure out why sometimes it fails to scrape objects that definitely seem tables.

Consider for instance a script like this:

require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
  xml2::read_html() %>%
  html_nodes(xpath='//*[@id="options"]/table/tbody/tr/td/table[2]/tbody') %>%
  html_table()
population

If I inspect population, it's an empty list:

> population
list()

Another example:

require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
  xml2::read_html() %>%
  html_nodes(xpath='//*[@id="Col1-1-OptionContracts-Proxy"]/section/section[1]/div[2]') %>%
  html_table()
population

I was wondering if the use of PhantomJS is mandatory - as explained here - or if the problem is elsewhere.

Upvotes: 1

Views: 69

Answers (1)

QHarr
QHarr

Reputation: 84465

Neither of your current xpaths actually select just the table. In both cases I think you need to pass an html table to html_table as under the hood there is:

html_table.xml_node(.) : html_name(x) == "table" 

Also, long xpaths are too fragile especially when applying a path valid for browser rendered content versus rvest return html - as javascript doesn't run with rvest. Personally, I prefer nice short CSS selectors. You can use the second fastest selector type of class and only need specify a single class

require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
  xml2::read_html() %>%
  html_node('.optionchain') %>%
  html_table()

The table needs cleaning of course, due to "merged" cells in source, but you get the idea.

With xpath you could do:

require(rvest)
url <- "http://bigcharts.marketwatch.com/quickchart/options.asp?symb=SPY"
population <- url %>%
  xml2::read_html() %>%
  html_node(xpath='//table[2]') %>%
  html_table()

Note: I reduce the xpath and work with a single node which represents a table.


For your second:

Again, your xpath is not selecting for a table element. The table class is multi-valued but a single correctly chosen class will suffice in xpath i.e. //*[contains(@class,"calls")] . Select for a single table node.

require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
  xml2::read_html() %>%
  html_node(xpath='//*[contains(@class,"calls")]') %>%
  html_table()

Once again, my preference is for a css selector (less typing!)

require(rvest)
url <- "https://finance.yahoo.com/quote/SPY/options?straddle=false"
population <- url %>%
  xml2::read_html() %>%
  html_node('.calls') %>%
  html_table()

Upvotes: 1

Related Questions