Need help scraping a big archive

Question

For a schoolproject i have to scrape a website which isn't a problem. But for it to be called BigData i wanted to scrape the whole archive(which is the past 5 years). The only thing that changes in the url is the date at the end of the url but i don't know how to write a script that changes only the date at the end.

The website I'm using is this: https://www.ongelukvandaag.nl/archief/ .

And the dates i need are from 01-01-2015 until 24-09-2020. The first part of the code i already figured out and I'm able to scrape 1 page. I'm a beginner at using R and would like to know if anyone could help me. The code is shown below. Thanks in advance!

This is what i got so far and the errors are underneath the code.

install.packages("XML")
install.packages("reshape")
install.packages("robotstxt")
install.packages("Rcrawler")
install.packages("RSelenium")
install.packages("devtools")
install.packages("exifr")
install.packages("Publish")

devtools::install_github("r-lib/xml2")

library(rvest)
library(dplyr)
library(xml)
library(stringr)
library(jsonlite)
library(xml12)
library(purrr)
library(tidyr)
library(reshape)
library(XML)
library(robotstxt)
library(Rcrawler)
library(RSelenium)
library(ps)
library(devtools)
library(exifr)
library(Publish)

#Create an url object

url<-"https://www.ongelukvandaag.nl/archief/%d "

#Verify the web can be scraped

paths_allowed(paths = c(url))

#Obtain the links for every day from 2015 to 2020

map_df(2015:2020, function(i){
  page<-read_html(sprintf(url,i))
  
  data.frame(Links = html_attr(html_nodes(page, ".archief a"),"href"))
}) -> Links %>%
  
Links$Links<-paste("https://www.ongelukvandaag.nl/",Links$Links,sep = "")

#Scrape what you want from each link:
  
d<- map(Links$Links, function(x) {
    
    Z <- read_html(x)
    
    Date <- Z %>% html_nodes(".text-muted") %>% html_text(trim = TRUE) # Last update
    All_title <- Z %>% html_nodes("h2") %>% html_text(trim = TRUE) # Title
    
    return(tibble(All_title,Date))
    
  })

The errors i get:

Error in open.connection(x, "rb") : HTTP error 400. 

in paste("https://www.ongelukvandaag.nl/", Links$Links, sep = "") :   object 'Links' not found >

in map(Links$Links, function(x) { : object 'Links' not found

and the packages "xml12" & "xml" don't work in this version of RStudio

Need help scraping a big archive

Answers (1)

Related Questions