Reputation: 171
I am currently experimenting with extracting documents from aws S3 and R. I have successfully managed to extract 1 document and create a dataframe with that document. I would like to be able to extract multiple documents which are within multiple sub folders of eventstore/footballStats/.
CODE demonstrates 1 document being pulled which works.
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat")) # runs an update for aws S3
library(aws.s3)
# Set credentials for S3 ####
Sys.setenv("AWS_ACCESS_KEY_ID" = "KEY","AWS_SECRET_ACCESS_KEY" = "AccessKey")
# Extracts 1 document raw vector representation of an S3 documents ####
DataVector <-get_object("s3://eventstore/footballStats/2017-04-22/13/01/doc1.json")
I have then tried code below to pull all documents from the folder and subfolders but receive an error.
DataVector <-get_object("s3://eventstore/footballStats/2017-04-22/*")
ERROR :
chr "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error>
<Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><K"| __truncated__
Is there an alternative r package I should be using? or Is the function get_object() only work for 1 document and I should be using another function from aws.s3 library?
Upvotes: 2
Views: 2001
Reputation: 171
Based on the hints from Drj and Thomas I was able to solve this..
### Displays Buckets in s3####
bucketlist()
### Builds a dataframe of the files in a bucket###
dfBucket <- get_bucket_df('eventstore', 'footballStats/2017-04-22/')
# creates path based on data in bucket
path <- dfBucket$Key
### Extracts all data into values ####
s3Data <- NULL
for (lineN in path) {
url <- paste('s3://eventstore/',lineN, sep= "")
s3Vector <- get_object(url)
s3Value <- rawToChar(s3Vector)
s3Data <- c(s3Data, s3Value)
}
To create a dataframe from the data use tidyjson and dplyr. See link for well explained document on this.
https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html
Upvotes: 3