Reputation: 25
I apologize in case I have not found a previous topic on this matter. I want to scrape this website http://www.fao.org/countryprofiles/en/ In particular, this page includes a lot of links to country infos. Those links'structure is:
http://www.fao.org/countryprofiles/index/en/?iso3=KAZ
http://www.fao.org/countryprofiles/index/en/?iso3=AFG
and any of this page includes a News section I am interested in. Of course, I could scrape page-by-page but that would be a waste of time.
I tried the following but that is not working:
countries <- read_html("http://www.fao.org/countryprofiles/en/") %>%
html_nodes(".linkcountry") %>%
html_text()
country_news <- list()
sub <- html_session("http://www.fao.org/countryprofiles/en/")
for(i in countries[1:100]){
page <- sub %>%
follow_link(i) %>%
read_html()
country_news[[i]] <- page %>%
html_nodes(".white-box") %>%
html_text()
}
Any idea?
Upvotes: 0
Views: 72
Reputation: 34763
You can get all of the child pages from the top-level page:
stem = 'http://www.fao.org'
top_level = paste0(stem, '/countryprofiles/en/')
all_children = read_html(top_level) %>%
# ? and = are required to skip /iso3list/en/
html_nodes(xpath = '//a[contains(@href, "?iso3=")]/@href') %>%
html_text %>% paste0(stem, .)
head(all_children)
# [1] "http://www.fao.org/countryprofiles/index/en/?iso3=AFG"
# [2] "http://www.fao.org/countryprofiles/index/en/?iso3=ALB"
# [3] "http://www.fao.org/countryprofiles/index/en/?iso3=DZA"
# [4] "http://www.fao.org/countryprofiles/index/en/?iso3=AND"
# [5] "http://www.fao.org/countryprofiles/index/en/?iso3=AGO"
# [6] "http://www.fao.org/countryprofiles/index/en/?iso3=ATG"
If you are not comfortable with xpath
, the CSS version would be:
html_nodes('a') %>% html_attr('href') %>%
grep("?iso3=", ., value = TRUE, fixed = TRUE) %>% paste0(stem, .)
Now you can loop over those pages & extract what you want
Upvotes: 1