tom
tom

Reputation: 33

Scraping html table with images using XML R package

I want to scrape html tables using the XML package of R, in a similar way to discussed on this thread:

Scraping html tables into R data frames using the XML package

The main difference with the data I want to extract, is that I also want text relating to an image in the html table. For example the table at http://www.theplantlist.org/tpl/record/kew-422570 contains a column for "Confidence" with an image showing one to three stars. If I use:

readHTMLTable("http://www.theplantlist.org/tpl/record/kew-422570")

then the output column for "Confidence" is blank apart from the header. Is there any way to get some form of data in this column, for example the HTML code linking to the appropriate image?

Any suggestions of how to go about this would be much appreciated!

Upvotes: 3

Views: 2029

Answers (3)

Chris S.
Chris S.

Reputation: 2225

You could also use the elFun argument to extract that attribute following section 5.2.2.1 in the XML book (I had to add ... to avoid an unused argument error)

getCL <- function(node, ...){
if(xmlName(node) == "td" && !is.null(node[["img"]]))
    xmlGetAttr(node[["img"]], "alt")
  else
    xmlValue(node)
}

url <- "http://www.theplantlist.org/tpl/record/kew-422570"
readHTMLTable(url, which=1, elFun = getCL)

                                                Name  Status Confi­-dence level Source
1                                Elymus arenarius L. Synonym                 H   WCSP
2 Elymus arenarius subsp. geniculatus (Curtis) Husn. Synonym                 L    TRO
3                Elymus geniculatus Curtis [Invalid] Synonym                 H   WCSP
4              Frumentum arenarium (L.) E.H.L.Krause Synonym                 H   WCSP
5                       Hordeum arenarium (L.) Asch. Synonym                 H   WCSP
6                            Hordeum villosum Moench Synonym                 H   WCSP
7                    Triticum arenarium (L.) F.Herm. Synonym                 H   WCSP

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78792

Here's an rvest solution with an even simpler CSS selector:

library(rvest)

pg <- html("http://www.theplantlist.org/tpl/record/kew-422570")
pg %>% html_nodes("td > img") %>% html_attr("src")

## [1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
## [6] "/img/H.png" "/img/H.png"

Upvotes: 4

Greg
Greg

Reputation: 11764

I was able to find the Xpath query to the image name using SelectorGadeget

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"

Upvotes: 4

Related Questions