Kim
Kim

Reputation: 4328

Scrape data hosted on AWS S3

I am trying to scrape some data offered in the public domain but hosted on AWS S3: the link is here. The page source does not carry much, so the usual

library(rvest)
url <- "https://dl.ncsbe.gov/index.html?prefix=data/SampleBallots/2018-11-06/"
read_html(url) %>% html_nodes("a")

will return nothing. By inspecting Elements, I have also tried

read_html(url) %>%
  html_nodes(xpath = '//*[@id="listing"]/pre/a[1]')

but no luck either.

My best bet was open Firefox, click ctrl + A, and right-click and ask for View Selection Source, which then I parsed for a nodes with some regex. But the method is quite ad hoc specially in a setup with more complicated subfolder structures.

I would like to ultimately be able to download everything in the link without manual intervention, including items in all subfolders. Is there a clever method in R to tackle data on AWS S3 that I am missing?

Upvotes: 1

Views: 2653

Answers (3)

camille
camille

Reputation: 16862

The aws.s3 package gives you easy access in R to the AWS S3 command line tools. It has helper functions to find public buckets like this one, and list the objects in a bucket.

I'll get a list of everything in that bucket within the subfolder you pointed to (data/SampleBallots). The default is to limit to 1000 records, which I overrode with max = Inf.

library(dplyr)
library(stringr)
library(purrr)
library(aws.s3)

ballot_keys <- get_bucket_df(bucket = "dl.ncsbe.gov", prefix = "data/SampleBallots", max = Inf) %>% 
  pull(Key)

length(ballot_keys)
#> [1] 10869

Maybe you do want all 10,869 objects in that folder. The keys come back as paths to each object, several of which are the zip files in the base SampleBallots.

ballot_keys[1:4]
#> [1] "data/SampleBallots/2008-05-06.zip" "data/SampleBallots/2008-06-24.zip"
#> [3] "data/SampleBallots/2008-11-04.zip" "data/SampleBallots/2009-09-15.zip"
length(str_subset(ballot_keys, "\\.zip$"))
#> [1] 24

Many more files are in those subfolders you mentioned, which I haven't combed through but which have keys like this one.

ballot_keys[200]
#> [1] "data/SampleBallots/2016-03-15/ANSON/ANSON-20160315-Style044-DEM-WADESBORO_1.pdf"

You could then use the package's save_object function to download whichever files you want. You could do that with just a subset of keys, like below, and some means of looping, an *apply function, or purrr::map/purrr::walk. Give each object a file path—I'm doing this by just taking the end of the key—and they'll download to the path you supply. I haven't downloaded more than 1 of these, because they're relatively large (~200MB).

str_subset(ballot_keys, "\\.zip$") %>%
  walk(function(key) {
    filename <- str_extract(key, "[\\d\\-]+\\.zip$")
    save_object(object = key, bucket = "dl.ncsbe.gov", file = filename)
  })

Upvotes: 1

Sonny
Sonny

Reputation: 3183

If all you want is to download that entire bucket which is public, then you can use the AWS CLI as below.

First locate which Bucket it is referring to:

Open the link in your browser and see the source which shows :

<script type="text/javascript">
    var S3BL_IGNORE_PATH = true;
    var BUCKET_NAME = 'dl.ncsbe.gov';
    var BUCKET_URL = 'https://s3.amazonaws.com';
    var S3B_ROOT_DIR = '';
</script>

So the bucketname is dl.ncsbe.gov. Install the AWS CLI from here

Now, you can download the entire bucket as below:

$ aws s3 sync s3://dl.ncsbe.gov .

This will download to your current directory

It is a large bucket, so I stopped it. Below is what I got:

$ ls -lrt
total 56
drwxrwxr-x  3 ubuntu ubuntu 4096 May  5 14:30 Campaign_Finance
drwxrwxr-x  2 ubuntu ubuntu 4096 May  5 14:30 Changed_Statutes
drwxrwxr-x  4 ubuntu ubuntu 4096 May  5 14:30 Elections
drwxrwxr-x  5 ubuntu ubuntu 4096 May  5 14:30 Ethics
drwxrwxr-x 11 ubuntu ubuntu 4096 May  5 14:30 One-Stop_Early_Voting
drwxrwxr-x  3 ubuntu ubuntu 4096 May  5 14:30 Outreach
drwxrwxr-x  2 ubuntu ubuntu 4096 May  5 14:30 PrecinctMaps
drwxrwxr-x  3 ubuntu ubuntu 4096 May  5 14:31 Press
drwxrwxr-x  4 ubuntu ubuntu 4096 May  5 14:31 Public_Records_Requests
drwxrwxr-x  3 ubuntu ubuntu 4096 May  5 14:31 Requests
drwxrwxr-x 64 ubuntu ubuntu 4096 May  5 14:31 ENRS
drwxrwxr-x  5 ubuntu ubuntu 4096 May  5 14:31 Rulemaking
drwxrwxr-x  6 ubuntu ubuntu 4096 May  5 14:31 NVRA
drwxrwxr-x 11 ubuntu ubuntu 4096 May  5 14:31 ShapeFiles

If you want a specific folder and not the entire bucket, then you can add that foldername also in the aws s3 command like:

$ aws s3 sync s3://dl.ncsbe.gov/data/SampleBallots/ .

Hope this helps!

Upvotes: 4

Tonio Liebrand
Tonio Liebrand

Reputation: 17719

The page seems to use javascript to include the data you are looking for, see https://dl.ncsbe.gov/list.js, which is sourced upon loading the page.

Pages that make use of javascript to load the data are not supported by rvest, so you might have to switch to R + PhantomJS or you could do it with RSelenium:

The xpath you used was very close, you could remove the indexing a[1] otherwise you just get the first element. So i suggest using: //*[@id="listing"]/pre/a.

Then you can extract the links by extracting the href attribute and use the links in download.file(). I added an example for ten links with RSelenium. Here is a great link to set up the package: https://rpubs.com/johndharrison/RSelenium-Docker.

Reproducible example:

n <- 11:20 # test for 10 links
remDr$navigate("https://dl.ncsbe.gov/index.html?prefix=data/SampleBallots/2018-11-06/")
elems <- remDr$findElements("xpath",  '//*[@id="listing"]/pre/a')
links <- unlist(sapply(elems[n], function(elem) elem$getElementAttribute("href")))

download <- function(link){
  splitted <- unlist(strsplit(link, "/"))
  fileName <- splitted[length(splitted)]
  download.file(url = link, destfile = fileName)
}
sapply(links, download)

Upvotes: 1

Related Questions