Reputation: 1533
Apologies for not providing a reprex, but if I could, I would not post this in the first place. I need to retrieve the excel files containing the word "età" in their filename listed at the link
https://github.com/apalladi/covid_vaccini_monitoraggio/tree/main/dati
and also store their file names in a vector.
Any idea about how to achieve that? I am thinking about using Rvest, but I am open to other reasonable suggestions. Note that the list of files needs to be obtained from the github page, since it is not known a priori. Thanks!
Upvotes: 1
Views: 165
Reputation: 173793
You should use the github API rather than scraping the website. This way, you can get the file names and the download links into a nice two-column data frame by doing:
library(httr)
library(dplyr)
req <- GET(paste0("https://api.github.com/repos/",
"apalladi/covid_vaccini_monitoraggio/contents/dati"))
file_list <- content(req)
filenames <- sapply(file_list, function(x) x$name)
file_list <- file_list[grepl("xlsx$", filenames)]
tibble(file = sapply(file_list, function(x) x$name),
link = sapply(file_list, function(x) x$download_url))
#> # A tibble: 30 x 2
#> file link
#> <chr> <chr>
#> 1 data_iss_età_2021-07-14.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 2 data_iss_età_2021-07-21.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 3 data_iss_età_2021-07-28.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 4 data_iss_età_2021-08-04.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 5 data_iss_età_2021-08-11.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 6 data_iss_età_2021-08-18.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 7 data_iss_età_2021-08-25.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 8 data_iss_età_2021-09-01.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 9 data_iss_età_2021-09-08.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 10 data_iss_età_2021-09-15.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> # ... with 20 more rows
Created on 2022-02-01 by the reprex package (v2.0.1)
Upvotes: 2