Reputation: 31
I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages. I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).
library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")
I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):
For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]
For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]
For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]
For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]
For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a
I now want to extract the information into a dataframe/table.
I tried using the xpathSApply function of the XML package to extract some of this data:
e.g. for infraspecies ranks
library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)
However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.
Does anyone know a better way to extract information from this table into a dataframe?
Any help would be much appreciated!
Tom
Upvotes: 3
Views: 5786
Reputation: 55695
Here is another solution, which splits each species name into its component parts
library(XML)
library(plyr)
# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)
# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)
# function to extract desired fields from a given node
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){
dl = lapply(fields, function(x) xpathSApply(node,
paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
tmp = rep(' ', length(dl))
tmp[sapply(dl, length) == 1] = unlist(dl)
confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
return(c(tmp, confidence))
}
# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')
It produces the following output
genus species infraspe authorship confidence
1 Critesion chilense (Roem. & Schult.) Ã\u0081.Löve H
2 Hordeum chilense chilense L
3 Hordeum cylindricum Steud. H
4 Hordeum depauperatum Steud. H
5 Hordeum pratense brongniartii Macloskie L
6 Hordeum secalinum chilense Ã\u0089.Desv. L
Upvotes: 5
Reputation: 179418
The following code parses your table into a matrix.
Caveats:
The code:
library(XML)
library(RCurl)
baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)
xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))
The results:
[,1] [,2] [,3] [,4]
[1,] "Critesion chilense (Roem. & Schult.) Ã.Löve" "Synonym" "" "WCSP"
[2,] "Hordeum chilense var. chilense " "Synonym" "" "TRO"
[3,] "Hordeum cylindricum Steud. [Illegitimate]" "Synonym" "" "WCSP"
[4,] "Hordeum depauperatum Steud." "Synonym" "" "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie" "Synonym" "" "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv." "Synonym" "" "WCSP"
Upvotes: 2