spindoctor
spindoctor

Reputation: 1895

Scraping this URL, R XML and getting siblings

Hi: I want to scrap the table Federal Electoral Districts – Representation Order of 2003 subtable "Ontario". The URL is here: http://www.elections.ca/content.aspx?section=res&dir=cir/list&document=index&lang=e#list

I've tried this code and it gets me close, but not entirely there.

doc<-htmlParse('http://www.elections.ca/content.aspx?section=res&dir=cir/list&document=index&lang=e#list', useInternalNodes=TRUE)
doc2<-getNodeSet(doc, "//table/caption[text()='Ontario']")

I know I could use readHTMLTable to do this simply and just find the particular table, but I also want to know how to select the sibling nodes of the caption node that equals Ontario. Thanks

Upvotes: 4

Views: 626

Answers (1)

jdharrison
jdharrison

Reputation: 30435

You can use following-sibling in your XPATH:

library(XML)
appURL <- 'http://www.elections.ca/content.aspx?section=res&dir=cir/list&document=index&lang=e#list'
doc<-htmlParse(appURL, encoding = "UTF-8")
tableNode <- doc["//*[@id='list']/following-sibling::table/caption[text()='Ontario']/.."][[1]]
myTable <- readHTMLTable(tableNode)
> head(myTable)
Code          Federal Electoral Districts Population 2006
1 35001                       Ajax–Pickering         117,183
2 35002        Algoma–Manitoulin–Kapuskasing          77,961
3 35003 Ancaster–Dundas–Flamborough–Westdale         111,844
4 35004                               Barrie         128,430
5 35005                    Beaches–East York         104,831
6 35006                 Bramalea–Gore–Malton         152,698

So to break down the XPATH. The heading Federal Electoral Districts – Representation Order of 2003 has an id="list". id's in HTML are unique so we can filter on this

  • //*[@id='list'] Find the node with id equal to "list"
  • /following-sibling::table Get all its sibling nodes that follow it that are tables
  • /caption[text()='Ontario'] Select the nodes that have caption with text equals "Ontario"
  • /.. Go back a node

This gives you the required table nodes as a list. There is only one node that satisfies the above requirements. This node can then be processed by readHTMLTable.

Upvotes: 2

Related Questions