larry77
larry77

Reputation: 1533

R + Rvest: retrieve files from github

Apologies for not providing a reprex, but if I could, I would not post this in the first place. I need to retrieve the excel files containing the word "età" in their filename listed at the link

https://github.com/apalladi/covid_vaccini_monitoraggio/tree/main/dati

and also store their file names in a vector.

Any idea about how to achieve that? I am thinking about using Rvest, but I am open to other reasonable suggestions. Note that the list of files needs to be obtained from the github page, since it is not known a priori. Thanks!

Upvotes: 1

Views: 165

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173793

You should use the github API rather than scraping the website. This way, you can get the file names and the download links into a nice two-column data frame by doing:

library(httr)
library(dplyr)

req <- GET(paste0("https://api.github.com/repos/", 
                  "apalladi/covid_vaccini_monitoraggio/contents/dati"))

file_list <- content(req)
filenames <- sapply(file_list, function(x) x$name)

file_list <- file_list[grepl("xlsx$", filenames)]

tibble(file = sapply(file_list, function(x) x$name),
       link = sapply(file_list, function(x) x$download_url))
#> # A tibble: 30 x 2
#>    file                         link                                            
#>    <chr>                        <chr>                                           
#>  1 data_iss_età_2021-07-14.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  2 data_iss_età_2021-07-21.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  3 data_iss_età_2021-07-28.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  4 data_iss_età_2021-08-04.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  5 data_iss_età_2021-08-11.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  6 data_iss_età_2021-08-18.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  7 data_iss_età_2021-08-25.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  8 data_iss_età_2021-09-01.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  9 data_iss_età_2021-09-08.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 10 data_iss_età_2021-09-15.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> # ... with 20 more rows

Created on 2022-02-01 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions