Makaroni
Makaroni

Reputation: 912

Read one by one file from s3 in r

I want to read csv files in r that are given in s3 directory. Each file is more than 6GB in size, and every file is needed for further calculation in r. Imagine that I have 10 files in s3 folder, I need to read each of them separately before for loop. Firstly, I tried this and it works in a case when I know name of the csv file:

library(aws.s3)
Sys.setenv("AWS_ACCESS_KEY_ID" = "xyy",
           "AWS_SECRET_ACCESS_KEY" = "yyx")

data <- 
  s3read_using(FUN=read.csv, object="my_folder/file.csv",
               sep = ",",stringsAsFactors = F, header=T)

However, how can I access multiple files without explicitly given their names in s3read_using function. This is neccessary beacuse I use partition() in Spark which divides original dataset into subparts with some generic names (e.g. part1-0839709037fnfih.csv). If I can automatically list csv files from a s3 folder and used them before my calculation that would be great.

get_ls_files <- .... #gives me list of all csv files in S3 folder

for (i in 1:length(get_ls_files)){

    filename = get_ls_files[i]

    tmp = s3read_using(FUN=read.csv, object=paste("my_folder/",filename),
               sep = ",",stringsAsFactors = F, header=T)

    .....
}

Upvotes: 3

Views: 3190

Answers (1)

Makaroni
Makaroni

Reputation: 912

I found an answer if anyone needs it, although the documentation is not good. To get a list of files in particular S3 folder you need to use get_bucket and define a prefix. After this, search the list for extension .csv and get list of all .csv files in particular S3 folder.

tmp = get_bucket(bucket = "my_bucket", prefix="folder/subfolder")
list_csv = data.frame(tmp)
csv_paths = list_csv$Key[grep(".csv", list_csv$Key)]

Upvotes: 5

Related Questions