Maria Oliveira
Maria Oliveira

Reputation: 3

Trying to web-scrape multiple links in r but no idea

I'm a total newbie and I'm trying to web scrape from this site , to get all editions from all the years.

I've been using rvest and a selector gadget, but it's useless. Any advice on this, please?

library(rvest)
library(purrr)
library(xml2)
library(textreadr)

url_base <- "https://rss.onlinelibrary.wiley.com/toc/14679868/2018/80/%d"
map_df(1:5, function(i){
       page <- read_html(sprintf(url_base, i))
       data.frame(VolumeID=html_text(html_nodes(page, ".loi-tab-item")),
       IssueID= html_text(html_nodes(page, ".visitable")),
       Heading=html_text(html_nodes(page, ".issue-items-container+ 
       .issue-items-container h2")),
       Author=html_text(html_nodes(page, " .author-style")),
       DOI= html_text(html_nodes(page, ".epub-doi")))

 }) -> royal2018

Upvotes: 0

Views: 109

Answers (1)

s__
s__

Reputation: 9525

Welcome to SO!

The second url seems ok, so here some hints to start, I do not know what you'd like to do, maybe scraping some info, so here we go.

First, you can use a selector gadget to find the parts you'd like to scrape, then you can proceed in this way:

# your url
url <- "http://www.biometria.ufla.br/index.php/BBJ/issue/archive"

# get all the links in the page
pages_data <- url %>% read_html() %>% 
              html_nodes('.title') %>% 
              html_attr('href') 

Now, for each page, you can fetch what you need:

# titles
titles <- list()                # empty list
for (i in pages_data[1:2]) {    # remove the [1:2] to get all the links
  titles[[i]] <- i %>% 
                 read_html() %>% 
                 html_nodes('.media-heading a') %>%
                 html_text()     
                 Sys.sleep(10)  # important to not pull too much requests in few time  
                           }

For the authors:

authors <- list()
for (i in pages_data[1:2]) {
  authors[[i]] <- i %>%
                  read_html() %>%
                  html_nodes('.authors') %>%
                  html_text()
                  Sys.sleep(10)
                           }

And so on. Now you can combine them as you want, and clean them up.

Upvotes: 1

Related Questions