Reputation: 4328
I am trying to scrape some data offered in the public domain but hosted on AWS S3: the link is here. The page source does not carry much, so the usual
library(rvest)
url <- "https://dl.ncsbe.gov/index.html?prefix=data/SampleBallots/2018-11-06/"
read_html(url) %>% html_nodes("a")
will return nothing. By inspecting Elements, I have also tried
read_html(url) %>%
html_nodes(xpath = '//*[@id="listing"]/pre/a[1]')
but no luck either.
My best bet was open Firefox, click ctrl + A
, and right-click and ask for View Selection Source
, which then I parsed for a
nodes with some regex. But the method is quite ad hoc specially in a setup with more complicated subfolder structures.
I would like to ultimately be able to download everything in the link without manual intervention, including items in all subfolders. Is there a clever method in R to tackle data on AWS S3 that I am missing?
Upvotes: 1
Views: 2653
Reputation: 16862
The aws.s3
package gives you easy access in R to the AWS S3 command line tools. It has helper functions to find public buckets like this one, and list the objects in a bucket.
I'll get a list of everything in that bucket within the subfolder you pointed to (data/SampleBallots
). The default is to limit to 1000 records, which I overrode with max = Inf
.
library(dplyr)
library(stringr)
library(purrr)
library(aws.s3)
ballot_keys <- get_bucket_df(bucket = "dl.ncsbe.gov", prefix = "data/SampleBallots", max = Inf) %>%
pull(Key)
length(ballot_keys)
#> [1] 10869
Maybe you do want all 10,869 objects in that folder. The keys come back as paths to each object, several of which are the zip files in the base SampleBallots
.
ballot_keys[1:4]
#> [1] "data/SampleBallots/2008-05-06.zip" "data/SampleBallots/2008-06-24.zip"
#> [3] "data/SampleBallots/2008-11-04.zip" "data/SampleBallots/2009-09-15.zip"
length(str_subset(ballot_keys, "\\.zip$"))
#> [1] 24
Many more files are in those subfolders you mentioned, which I haven't combed through but which have keys like this one.
ballot_keys[200]
#> [1] "data/SampleBallots/2016-03-15/ANSON/ANSON-20160315-Style044-DEM-WADESBORO_1.pdf"
You could then use the package's save_object
function to download whichever files you want. You could do that with just a subset of keys, like below, and some means of looping, an *apply
function, or purrr::map
/purrr::walk
. Give each object a file path—I'm doing this by just taking the end of the key—and they'll download to the path you supply. I haven't downloaded more than 1 of these, because they're relatively large (~200MB).
str_subset(ballot_keys, "\\.zip$") %>%
walk(function(key) {
filename <- str_extract(key, "[\\d\\-]+\\.zip$")
save_object(object = key, bucket = "dl.ncsbe.gov", file = filename)
})
Upvotes: 1
Reputation: 3183
If all you want is to download that entire bucket which is public
, then you can use the AWS CLI
as below.
Open the link in your browser and see the source
which shows :
<script type="text/javascript">
var S3BL_IGNORE_PATH = true;
var BUCKET_NAME = 'dl.ncsbe.gov';
var BUCKET_URL = 'https://s3.amazonaws.com';
var S3B_ROOT_DIR = '';
</script>
So the bucketname is dl.ncsbe.gov. Install the AWS CLI from here
Now, you can download the entire bucket as below:
$ aws s3 sync s3://dl.ncsbe.gov .
This will download to your current directory
It is a large bucket, so I stopped it. Below is what I got:
$ ls -lrt
total 56
drwxrwxr-x 3 ubuntu ubuntu 4096 May 5 14:30 Campaign_Finance
drwxrwxr-x 2 ubuntu ubuntu 4096 May 5 14:30 Changed_Statutes
drwxrwxr-x 4 ubuntu ubuntu 4096 May 5 14:30 Elections
drwxrwxr-x 5 ubuntu ubuntu 4096 May 5 14:30 Ethics
drwxrwxr-x 11 ubuntu ubuntu 4096 May 5 14:30 One-Stop_Early_Voting
drwxrwxr-x 3 ubuntu ubuntu 4096 May 5 14:30 Outreach
drwxrwxr-x 2 ubuntu ubuntu 4096 May 5 14:30 PrecinctMaps
drwxrwxr-x 3 ubuntu ubuntu 4096 May 5 14:31 Press
drwxrwxr-x 4 ubuntu ubuntu 4096 May 5 14:31 Public_Records_Requests
drwxrwxr-x 3 ubuntu ubuntu 4096 May 5 14:31 Requests
drwxrwxr-x 64 ubuntu ubuntu 4096 May 5 14:31 ENRS
drwxrwxr-x 5 ubuntu ubuntu 4096 May 5 14:31 Rulemaking
drwxrwxr-x 6 ubuntu ubuntu 4096 May 5 14:31 NVRA
drwxrwxr-x 11 ubuntu ubuntu 4096 May 5 14:31 ShapeFiles
If you want a specific folder and not the entire bucket, then you can add that foldername also in the aws s3
command like:
$ aws s3 sync s3://dl.ncsbe.gov/data/SampleBallots/ .
Hope this helps!
Upvotes: 4
Reputation: 17719
The page seems to use javascript to include the data you are looking for, see https://dl.ncsbe.gov/list.js, which is sourced upon loading the page.
Pages that make use of javascript to load the data are not supported by rvest
, so you might have to switch to R
+ PhantomJS
or you could do it with RSelenium
:
The xpath you used was very close, you could remove the indexing a[1]
otherwise you just get the first element. So i suggest using: //*[@id="listing"]/pre/a
.
Then you can extract the links by extracting the href
attribute and use the links in download.file()
. I added an example for ten links with RSelenium
. Here is a great link to set up the package: https://rpubs.com/johndharrison/RSelenium-Docker.
Reproducible example:
n <- 11:20 # test for 10 links
remDr$navigate("https://dl.ncsbe.gov/index.html?prefix=data/SampleBallots/2018-11-06/")
elems <- remDr$findElements("xpath", '//*[@id="listing"]/pre/a')
links <- unlist(sapply(elems[n], function(elem) elem$getElementAttribute("href")))
download <- function(link){
splitted <- unlist(strsplit(link, "/"))
fileName <- splitted[length(splitted)]
download.file(url = link, destfile = fileName)
}
sapply(links, download)
Upvotes: 1