andschar
andschar

Reputation: 3973

Scrape nested html structure

I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil, which not only belongs to benzanilide fungicides, but also to anilide fungicides and amide fungicides. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:

name class1 class2 class3 ...
benodanil benzanilide fungicides anilide fungicides amide fungicides NA
aureofungin antibiotic fungicides NA NA NA
... ... ... ...

I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:

require(rvest)

url = 'http://www.alanwood.net/pesticides/class_fungicides.html'

site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')

# loop idea
l = list()
for (i in seq_along(li)) {
  li1 = html_nodes(li[i], 'a')
  name = na.omit(unique(html_attr(li1, 'href')))
  clas = na.omit(unique(html_attr(li1, 'name')))
  
  l[[i]] = list(name = name,
                clas = clas)
}

An additional problem is, that some names occur more than one time, such as bixafen. Hence, I guess the job has to be done iteratively.

Upvotes: 0

Views: 150

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388862

library(dplyr)
library(tidyr)
library(rvest)

url = 'http://www.alanwood.net/pesticides/class_fungicides.html'

site = read_html(url)
a <- site %>% html_nodes('li ul a')

tibble(name = a %>% html_attr('href'), 
       class = a %>% html_attr('name')) %>%
  fill(class) %>%
  filter(!is.na(name)) %>%
  mutate(name = sub('\\.html', '', name)) %>%
  group_by(name) %>%
  mutate(col = paste0('class', row_number())) %>%
  pivot_wider(names_from = col, values_from = class) %>%
  ungroup()

# A tibble: 189 x 4
#   name         class1                  class2                class3                     
#   <chr>        <chr>                   <chr>                 <chr>                      
# 1 benalaxyl    acylamino_acid_fungici… anilide_fungicides    NA                         
# 2 benalaxyl-m  acylamino_acid_fungici… anilide_fungicides    NA                         
# 3 furalaxyl    acylamino_acid_fungici… furanilide_fungicides NA                         
# 4 metalaxyl    acylamino_acid_fungici… anilide_fungicides    NA                         
# 5 metalaxyl-m  acylamino_acid_fungici… anilide_fungicides    NA                         
# 6 pefurazoate  acylamino_acid_fungici… NA                    NA                         
# 7 valifenalate acylamino_acid_fungici… NA                    NA                         
# 8 bixafen      anilide_fungicides      picolinamide_fungici… pyrazolecarboxamide_fungic…
# 9 boscalid     anilide_fungicides      NA                    NA                         
#10 carboxin     anilide_fungicides      NA                    NA                         
# … with 179 more rows

Extract name and class from the webpage, fill the NA values with the previous non-NA, drop rows with NA values and get the data in wide format.

Upvotes: 2

Related Questions