Scraping links in df columns with rvest

Question

I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function" at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?

This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep and how?)

library(rvest)

df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540", 
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550", 
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740", 
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")

df$links_text <- lapply(df$testo, function(x) {
  page <- read_html(x)
  links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
    html_text(trim = T)
})

df$text <- lapply(df$links_text, function(x) {
  page1 <- read_html(x)
  links1 <- html_nodes(page, 'p') %>%
    html_text(trim = T)
})

Ronak Shah · Accepted Answer

You may do this in single sapply command and use tryCatch to handle errors.

library(rvest)

df$text  <- sapply(df$testo, function(x) {
  tryCatch({
    x %>%
      read_html() %>%
      html_nodes('.value:nth-child(8) .fixed') %>%
      html_text(trim = T) %>%
      read_html %>%
      html_nodes('p') %>%
      html_text(trim = T) %>%
      toString()
  }, error = function(e) NA)
})

Scraping links in df columns with rvest

Answers (2)

Related Questions