Reputation: 3
I'm a total newbie and I'm trying to web scrape from this site , to get all editions from all the years.
I've been using rvest
and a selector gadget, but it's useless. Any advice on this, please?
library(rvest)
library(purrr)
library(xml2)
library(textreadr)
url_base <- "https://rss.onlinelibrary.wiley.com/toc/14679868/2018/80/%d"
map_df(1:5, function(i){
page <- read_html(sprintf(url_base, i))
data.frame(VolumeID=html_text(html_nodes(page, ".loi-tab-item")),
IssueID= html_text(html_nodes(page, ".visitable")),
Heading=html_text(html_nodes(page, ".issue-items-container+
.issue-items-container h2")),
Author=html_text(html_nodes(page, " .author-style")),
DOI= html_text(html_nodes(page, ".epub-doi")))
}) -> royal2018
Upvotes: 0
Views: 109
Reputation: 9525
Welcome to SO!
The second url seems ok, so here some hints to start, I do not know what you'd like to do, maybe scraping some info, so here we go.
First, you can use a selector gadget to find the parts you'd like to scrape, then you can proceed in this way:
# your url
url <- "http://www.biometria.ufla.br/index.php/BBJ/issue/archive"
# get all the links in the page
pages_data <- url %>% read_html() %>%
html_nodes('.title') %>%
html_attr('href')
Now, for each page, you can fetch what you need:
# titles
titles <- list() # empty list
for (i in pages_data[1:2]) { # remove the [1:2] to get all the links
titles[[i]] <- i %>%
read_html() %>%
html_nodes('.media-heading a') %>%
html_text()
Sys.sleep(10) # important to not pull too much requests in few time
}
For the authors:
authors <- list()
for (i in pages_data[1:2]) {
authors[[i]] <- i %>%
read_html() %>%
html_nodes('.authors') %>%
html_text()
Sys.sleep(10)
}
And so on. Now you can combine them as you want, and clean them up.
Upvotes: 1