Reputation: 41
I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply
but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function"
at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?
This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep
and how?)
library(rvest)
df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")
df$links_text <- lapply(df$testo, function(x) {
page <- read_html(x)
links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
html_text(trim = T)
})
df$text <- lapply(df$links_text, function(x) {
page1 <- read_html(x)
links1 <- html_nodes(page, 'p') %>%
html_text(trim = T)
})
Upvotes: 1
Views: 128
Reputation: 388982
You may do this in single sapply
command and use tryCatch
to handle errors.
library(rvest)
df$text <- sapply(df$testo, function(x) {
tryCatch({
x %>%
read_html() %>%
html_nodes('.value:nth-child(8) .fixed') %>%
html_text(trim = T) %>%
read_html %>%
html_nodes('p') %>%
html_text(trim = T) %>%
toString()
}, error = function(e) NA)
})
Upvotes: 1
Reputation: 141
You want links1 <- html_nodes(page, 'p')
to refer to page1
, not page
.
[Otherwise (as there is no object page
in the function environment, it is trying to apply html_nodes to the utils function page
]
In terms of Sys_sleep
, it is fairly optional. Check in the page html and see whether there is anything in the code or user agreement prohibiting scraping. If so, then scraping more kindly to the server might improve your chances of not getting blocked!
You can just include Sys.sleep(n)
in your function where you create df$text
. n is up to you, I've had luck with 1-3 seconds, but it does become pretty slow/long!
Upvotes: 2