Reputation: 2491
I'm trying to get data for RAIS (a Brazilian employee registry dataset) that is shared using a Google Drive public folder. This is the address: https://drive.google.com/folderview?id=0ByKsqUnItyBhZmNwaXpnNXBHMzQ&usp=sharing&tid=0ByKsqUnItyBhU2RmdUloTnJGRGM#list
Data is divided into one folder per year and within each folder there is one file per state to download. I would like to automate the downloading process in R, for all years, and if not at least within each year folder. Downloaded file names should follow the file names that occur when downloading manually.
A know a little R, but no web programming or web scraping. This is what I got so faar: By manually downloading the first of the 2012 file, I could see the URL my browser used to download: https://drive.google.com/uc?id=0ByKsqUnItyBhS2RQdFJ2Q0RrN0k&export=download
Thus, I suppose the file id is: 0ByKsqUnItyBhS2RQdFJ2Q0RrN0k
Searching the html code of the 2012 page I was able to find that ID and the file name associated with it: AC2012.7z. All the other ids' and file names are in that section of the html code. So, assuming I can download the file correctly, I suppose I could at least generalize tho the other files.
In R, I tried the flowing code to download the file:
url <- "https://drive.google.com/uc?id=0ByKsqUnItyBhS2RQdFJ2Q0RrN0k&export=download"
download.file(url,"AC2012.7z")
unzip("AC2012.7z")
It does download but I get and error when trying to uncompress the file (both within R and manually with 7.zip) There must be something wrong with file downloaded in R, as the the file size (3.412Kb) does not match what I get from manualy downloading the file (3.399Kb)
Upvotes: 3
Views: 3074
Reputation: 1059
For anyone trying to solve this problem today, you can use the googledrive
package.
library(googledrive)
ls_tibble <- googledrive::drive_ls(GOOGLE_DRIVE_URL_FOR_THE_TARGET_FOLDER)
for (file_id in ls_tibble$id) {
googledrive::drive_download(as_id(file_id))
}
This will (1) trigger an authentication page to open in your browser to authorise the Tidyverse libraries using gargle
to access Google Drive on behalf of your account and (2) download all the files in the folder at that URL to your current working directory for the current R session.
Upvotes: 1