Reputation: 5987
I want to extract a table from web http://en.wikipedia.org/wiki/Brazil_national_football_team
library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)
xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")
Here is the xpath :"//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]"
neither
xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")
nor
xmltable <- xpathApply(xmltext, "//table[//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")
Can get the specified table. How can I write xpath expression?
Please see the attchment .
Upvotes: 0
Views: 867
Reputation: 5987
i find two secretaries in parsing the web too,
1.tbody can't be known
tableNode <- xpathApply(xmltext, "//tbody")
can get nothing.there are many tbody element in the web ,none of them were be recognized as formal element.
2.to directly get the table,not to use the concept of parent element
tableNode <- xpathApply(xmltext, "//table[@class='wikitable'][./tr/th/a[@title='CONCACAF Gold Cup']]") can work too.
Upvotes: 0
Reputation: 25736
You have to use ..
to get the parent element in your xpath: //table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..
To get the table you could use XML::readHTMLTable
:
library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)
## grep correct table
tableNode <- xpathApply(xmltext, "//table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..")[[1]]
## convert XMLNode into data.frame
concacafTable <- readHTMLTable(tableNode, header=FALSE, stringsAsFactors=FALSE)
## format table (remove useless "Gold Cup"-header (row 1) and set right header (row 2)
colnames(concacafTable) <- concacafTable[2, ]
concacafTable <- concacafTable[-c(1,2),]
concacafTable
# Year Round GP W D L GF GA
#3 1996 Runners-up 4 3 0 1 10 3
#4 1998 Third Place 5 2 2 1 6 2
#5 2003 Runners-up 5 3 0 2 6 4
#6 Total 3/11 14 8 2 4 22 9
Upvotes: 1