Reputation: 912
I want to read csv files in r that are given in s3 directory. Each file is more than 6GB in size, and every file is needed for further calculation in r. Imagine that I have 10 files in s3 folder, I need to read each of them separately before for loop
. Firstly, I tried this and it works in a case when I know name of the csv file:
library(aws.s3)
Sys.setenv("AWS_ACCESS_KEY_ID" = "xyy",
"AWS_SECRET_ACCESS_KEY" = "yyx")
data <-
s3read_using(FUN=read.csv, object="my_folder/file.csv",
sep = ",",stringsAsFactors = F, header=T)
However, how can I access multiple files without explicitly given their names in s3read_using function. This is neccessary beacuse I use partition()
in Spark which divides original dataset into subparts with some generic names (e.g. part1-0839709037fnfih.csv
). If I can automatically list csv files from a s3 folder and used them before my calculation that would be great.
get_ls_files <- .... #gives me list of all csv files in S3 folder
for (i in 1:length(get_ls_files)){
filename = get_ls_files[i]
tmp = s3read_using(FUN=read.csv, object=paste("my_folder/",filename),
sep = ",",stringsAsFactors = F, header=T)
.....
}
Upvotes: 3
Views: 3190
Reputation: 912
I found an answer if anyone needs it, although the documentation is not good. To get a list of files in particular S3 folder you need to use get_bucket
and define a prefix
. After this, search the list for extension .csv
and get list of all .csv
files in particular S3 folder.
tmp = get_bucket(bucket = "my_bucket", prefix="folder/subfolder")
list_csv = data.frame(tmp)
csv_paths = list_csv$Key[grep(".csv", list_csv$Key)]
Upvotes: 5