Mohan Raj
Mohan Raj

Reputation: 93

Extracting table from HTML (Yahoo) by using XML package in R

Using XML package in R, I thought of extracting a table with the below mentioned query,

url <- "https://in.finance.yahoo.com/intlindices?e=americas"

America <- readHTMLTable(url, which=1, header=TRUE, stringsAsFactors=FALSE)

when I executed the above mentioned query, I got the output as,

**Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: 'https://in.finance.yahoo.com/intlindices?e=americas'**

when I parse the url, I got the below error,

**Warning message: XML content does not seem to be XML: **

Therefore, kindly help me to know as whether i am not using the right package or my way of doing coding is wrong.

Upvotes: 0

Views: 286

Answers (1)

jlhoward
jlhoward

Reputation: 59345

Try this:

library(httr)
library(XML)
doc <- content(GET(url), type="text/html")
readHTMLTable(doc["//div[@id='yfitp']"][[1]])
#        V1                V2                      V3              V4                      V5
# 1   ^MERV            MerVal 10,887.94 12 Sep 1:30am 181.54  (1.64%) Components, Chart, More
# 2   ^BVSP           Bovespa 46,400.50 12 Sep 1:47am 103.49  (0.22%) Components, Chart, More
# 3 ^GSPTSE S&P TSX Composite 13,461.47 12 Sep 1:50am 108.42  (0.80%)             Chart, More
# 4    ^MXX               IPC 42,780.73 12 Sep 1:36am 107.78  (0.25%) Components, Chart, More
# 5   ^GSPC         500 Index  1,961.05 12 Sep 2:02am    8.76 (0.45%)             Chart, More

Edit: Clarification based on comment below.

The term doc["//div[@id='yfitp']"] is equivalent to getNodeSet(doc, "//div[@id='yfitp']") and returns a list of the nodes in doc which satisfy the specified xPath filter. Since this is a nodeSet, but readHTMLTable(...) requires a node, we grab the first node in the nodeset (also the only node, in this case).

If the question is how to determine the xPath string, I just examined the DOM of the page in Firefox and it was clear that the relevant table was a child node of the div element, as:

<div id=yfitp>
  <table>
     ...
  </table>
</div>

Upvotes: 1

Related Questions