Scrape nested html structure

Question

I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil, which not only belongs to benzanilide fungicides, but also to anilide fungicides and amide fungicides. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:

name	class1	class2	class3	...
benodanil	benzanilide fungicides	anilide fungicides	amide fungicides	NA
aureofungin	antibiotic fungicides	NA	NA	NA
...	...	...	...

I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:

require(rvest)

url = 'http://www.alanwood.net/pesticides/class_fungicides.html'

site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')

# loop idea
l = list()
for (i in seq_along(li)) {
  li1 = html_nodes(li[i], 'a')
  name = na.omit(unique(html_attr(li1, 'href')))
  clas = na.omit(unique(html_attr(li1, 'name')))
  
  l[[i]] = list(name = name,
                clas = clas)
}

An additional problem is, that some names occur more than one time, such as bixafen. Hence, I guess the job has to be done iteratively.

Scrape nested html structure

Answers (1)

Related Questions