Alberto
Alberto

Reputation: 41

Scraping links in df columns with rvest

I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function" at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?

This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep and how?)

library(rvest)

df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540", 
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550", 
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740", 
             "http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")

df$links_text <- lapply(df$testo, function(x) {
  page <- read_html(x)
  links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
    html_text(trim = T)
})

df$text <- lapply(df$links_text, function(x) {
  page1 <- read_html(x)
  links1 <- html_nodes(page, 'p') %>%
    html_text(trim = T)
})

Upvotes: 1

Views: 128

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

You may do this in single sapply command and use tryCatch to handle errors.

library(rvest)

df$text  <- sapply(df$testo, function(x) {
  tryCatch({
    x %>%
      read_html() %>%
      html_nodes('.value:nth-child(8) .fixed') %>%
      html_text(trim = T) %>%
      read_html %>%
      html_nodes('p') %>%
      html_text(trim = T) %>%
      toString()
  }, error = function(e) NA)
})

Upvotes: 1

O.A
O.A

Reputation: 141

You want links1 <- html_nodes(page, 'p') to refer to page1, not page.

[Otherwise (as there is no object page in the function environment, it is trying to apply html_nodes to the utils function page]

In terms of Sys_sleep, it is fairly optional. Check in the page html and see whether there is anything in the code or user agreement prohibiting scraping. If so, then scraping more kindly to the server might improve your chances of not getting blocked!

You can just include Sys.sleep(n) in your function where you create df$text. n is up to you, I've had luck with 1-3 seconds, but it does become pretty slow/long!

Upvotes: 2

Related Questions