wizkids121
wizkids121

Reputation: 656

Looping together extracted CSVs into one dataframe

I have some CSV files coming from an s3 bucket that I would like to merge into one dataframe. For the purposes of this exercise, let's use these five files stored in a character vector called script_results:

[1] "05162021-test-america-distributed-a-script_results-03-16-2021-404806813.csv"
[2] "07262021-test-america-distributed-a-script_results-04-26-2021-534691276.csv"
[3] "09202021-test-arizona-script_results-08-19-2020--08-31-2020-354731384.csv"
[4] "09222021-test-wisconsin-script_results-08-19-2020--08-20-2020-068472211.csv"
[5] "01062021-av-america-script_results-01-5-2021-038680844.csv"

The first step is extracting the data from each CSV, using the following (with the first .csv listed above as the example):

extract_data <- aws.s3::s3read_using(read.csv, bucket = "test-america", object = "03162021-test-america-distributed-a-script_results-03-16-2021-404806813.csv")

It would then continuously bind together each of these unpacked CSV's into one dataframe using the rbind.fill function from the plyr package.

So the end result of this would be all five of the data sets coming from the five CSVs listed under script_results merged together using the rbind.fill function.

I should also note that I will be receiving a new file on a daily basis, so I want to continuously adjust for those new files reaching my S3 bucket.

Is this possible to do?

EDIT: Using @r2evans comments

L <- lapply(script_results, function(fn) aws.s3::s3read_using(read.csv, bucket="bucket-name", object=fn))

This works. But the issue is when I try to convert it to a dataframe using:

t<- do.call(rbind.data.frame, L)

I'm given the following error:

Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = FALSE,  : 
  numbers of columns of arguments do not match

Upvotes: 0

Views: 60

Answers (1)

Skaqqs
Skaqqs

Reputation: 4140

I'm not familiar with the functions from aws.s3, but here's the general idea in base and maybe you can translate it to work for your case.

library(plyr)

# your old dataframe of bucket data from "yesterday" with a field that holds the
# name of each csv that a given record is associated with
df_old <- data.frame(otherdata = c(1,2,3), csv = "from_a_previous_bucket_dump.csv")

# get the names of files in the bucket "today"
scriptresults <- list.files()

# only read csvs that aren't already in df_old$csv
fs_new <- setdiff(scriptresults, df_old$csv)

# Read a csv and add it name as a new field
read_csv_filename <- function(filename){
  ret <- read.csv(filename)
  ret$csv <- filename
  return(ret)
}
fsl <- lapply(fs_new, read_csv_filename)

df_new <- plyr::rbind.fill(c(fsl, list(df_old)))

read_csv_filename() is from When importing CSV into R how to generate column with name of the CSV?

Upvotes: 1

Related Questions