Alberto
Alberto

Reputation: 41

Getting the full link with rvest href

I am trying to scrape multiple pages with rvest. However, the link I get through html_attr("href") is incomplete. The intial part of the link unfortunately changes across pages in a way that I am unable to understand. Do you know if there is a solution? Thank you.

These are two examples of the website. The part that changes across pages seems to be "/sk####". (I am interested in the links to the Relazione and Testo articoli pages.

http://leg14.camera.it/_dati/leg14/lavori/stampati/sk5000/frontesp/4543.htm

http://leg14.camera.it/_dati/leg14/lavori/stampati/sk4500/frontesp/4477.htm


df <- structure(list(date = c(20010618L, 20010618L, 20010618L, 20010618L, 
                        20010618L), link = c("http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=814", 
                                                  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=858", 
                                                  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=875", 
                                                  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=802", 
                                                  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=816"
                        )), row.names = c(NA, 5L), class = "data.frame")


df$linkfinal<- pbsapply(df$link, function(x) {
  tryCatch({
    x %>%
      read_html() %>%
      html_nodes('td+ td a') %>%
      html_attr("href") %>% 
      toString()
  }, error = function(e) NA)
})

Upvotes: 0

Views: 515

Answers (1)

QHarr
QHarr

Reputation: 84475

You simply need to capture the re-direct url (use httr), then swop out the string frontesp with either articola or relazion. If you first use these same substrings to test if the href containing this is present, you can leverage ifelse to either do the url substitution described above or return NA.

There are faster ways of applying a function if working with large numbers of rows. I was just interested in this approach after reading about it here: https://blog.az.sg/posts/map-and-walk/.

library(tidyverse)
library(httr)

df <- structure(list(date = c(
  20010618L, 20010618L, 20010618L, 20010618L,
  20010618L
), link = c(
  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=814",
  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=858",
  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=875",
  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=802",
  "http://leg14.camera.it/_dati/leg14/lavori/schedela/trovastampatopdl.asp?pdl=816"
)), row.names = c(NA, 5L), class = "data.frame")


get_link <- function(url, page, url_sub_string) {
  link <- page %>%
    html_element(sprintf("[href*=%s]", url_sub_string)) %>%
    html_attr("href")
  link <- ifelse(is.na(link), link, gsub("frontesp", url_sub_string, url))
  return(link)
}

df <- df %>%
  pmap_dfr(function(...) {
    current <- tibble(...)
    r <- GET(current$link)
    page <- r %>% read_html()
    redirect_link <- r$url
    current %>%
      mutate(
        articola = get_link(redirect_link, page, "articola"),
        relazion = get_link(redirect_link, page, "relazion")
      )
  })

Upvotes: 1

Related Questions