Dd Pp
Dd Pp

Reputation: 5987

specified table extraction with xpath

I want to extract a table from web http://en.wikipedia.org/wiki/Brazil_national_football_team

library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)
xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]") 

Here is the xpath :"//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]"

neither

xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")  

nor

xmltable <- xpathApply(xmltext, "//table[//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")

Can get the specified table. How can I write xpath expression?
Please see the attchment . enter image description here

Upvotes: 0

Views: 867

Answers (2)

Dd Pp
Dd Pp

Reputation: 5987

i find two secretaries in parsing the web too,

1.tbody can't be known

tableNode <- xpathApply(xmltext, "//tbody") 

can get nothing.there are many tbody element in the web ,none of them were be recognized as formal element.

2.to directly get the table,not to use the concept of parent element

tableNode <- xpathApply(xmltext, "//table[@class='wikitable'][./tr/th/a[@title='CONCACAF Gold Cup']]") can work too. 

Upvotes: 0

sgibb
sgibb

Reputation: 25736

You have to use .. to get the parent element in your xpath: //table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..

To get the table you could use XML::readHTMLTable:

library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)

## grep correct table
tableNode <- xpathApply(xmltext, "//table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..")[[1]]

## convert XMLNode into data.frame
concacafTable <- readHTMLTable(tableNode, header=FALSE, stringsAsFactors=FALSE)

## format table (remove useless "Gold Cup"-header (row 1) and set right header (row 2)
colnames(concacafTable) <- concacafTable[2, ]
concacafTable <- concacafTable[-c(1,2),]
concacafTable
#   Year       Round GP W D L GF GA
#3  1996  Runners-up  4 3 0 1 10  3
#4  1998 Third Place  5 2 2 1  6  2
#5  2003  Runners-up  5 3 0 2  6  4                                                 
#6 Total        3/11 14 8 2 4 22  9

Upvotes: 1

Related Questions