Reputation: 3973
I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil
, which not only belongs to benzanilide fungicides
, but also to anilide fungicides
and amide fungicides
. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:
name | class1 | class2 | class3 | ... |
---|---|---|---|---|
benodanil | benzanilide fungicides | anilide fungicides | amide fungicides | NA |
aureofungin | antibiotic fungicides | NA | NA | NA |
... | ... | ... | ... |
I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:
require(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')
# loop idea
l = list()
for (i in seq_along(li)) {
li1 = html_nodes(li[i], 'a')
name = na.omit(unique(html_attr(li1, 'href')))
clas = na.omit(unique(html_attr(li1, 'name')))
l[[i]] = list(name = name,
clas = clas)
}
An additional problem is, that some names occur more than one time, such as bixafen
. Hence, I guess the job has to be done iteratively.
Upvotes: 0
Views: 150
Reputation: 388862
library(dplyr)
library(tidyr)
library(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
a <- site %>% html_nodes('li ul a')
tibble(name = a %>% html_attr('href'),
class = a %>% html_attr('name')) %>%
fill(class) %>%
filter(!is.na(name)) %>%
mutate(name = sub('\\.html', '', name)) %>%
group_by(name) %>%
mutate(col = paste0('class', row_number())) %>%
pivot_wider(names_from = col, values_from = class) %>%
ungroup()
# A tibble: 189 x 4
# name class1 class2 class3
# <chr> <chr> <chr> <chr>
# 1 benalaxyl acylamino_acid_fungici… anilide_fungicides NA
# 2 benalaxyl-m acylamino_acid_fungici… anilide_fungicides NA
# 3 furalaxyl acylamino_acid_fungici… furanilide_fungicides NA
# 4 metalaxyl acylamino_acid_fungici… anilide_fungicides NA
# 5 metalaxyl-m acylamino_acid_fungici… anilide_fungicides NA
# 6 pefurazoate acylamino_acid_fungici… NA NA
# 7 valifenalate acylamino_acid_fungici… NA NA
# 8 bixafen anilide_fungicides picolinamide_fungici… pyrazolecarboxamide_fungic…
# 9 boscalid anilide_fungicides NA NA
#10 carboxin anilide_fungicides NA NA
# … with 179 more rows
Extract name
and class
from the webpage, fill
the NA
values with the previous non-NA, drop rows with NA
values and get the data in wide format.
Upvotes: 2