Reputation: 395

Aws Glue - S3 - Native Python

Within AWS Glue how do I deal with files from S3 that will change every week.

Example: Week 1: “filename01072018.csv” Week 2: “filename01142018.csv”

These files are setup in the same format but I need Glue to be able to change per week to load this data into Redshift from S3. The code for Glue uses native Python as the backend.

Upvotes: 1

Answers (3)

Bob McCormick

Reputation: 225

If you have two different TYPES of files (with different internal formats), they must be in separate folder hierarchies. There is no way to tell a crawler to only look for redfile*.csv and ignore bluefile%.csv. Instead user separate hierarchies like:

s3://my-bucket/redfiles/
                       redfile01072018.csv
                       redfile01142018.csv
                       ...
s3://my-bucket/bluefiles/
                       bluefile01072018.csv
                       bluefile01142018.csv
                       ...

Setup two crawlers, one crawling s3://my-bucket/redfiles/ and the other crawling s3://my-bucket/bluefiles/

Upvotes: 0

Abraham

Reputation: 443

AWS Glue should be able to process all the files in a folder irrespective of the name in a single job. If you don’t want the old file to be processed again move it using boto3 api for s3 to another location after each run.

Upvotes: 0

Honest Charley Bodkin

Reputation: 640

AWS Glue crawlers should be able to just find your CSV files as they are named without any configuration on your part.

For instance, my Kinesis stream produces files that have paths and names that look like these:

my_events_folder/2018/02/13/20/my-prefix-3-2018-02-13-20-18-28-112ab3f0-5794-4f77-9a84-83efafeecabc
my_events_folder/2018/02/13/20/my-prefix-2-2018-02-13-20-12-00-7f2efb62-827b-46a6-83c4-b4c52dd87d60
...

AWS Glue just finds these files and classifies them automatically. Hope this helps.

Upvotes: 1

Aws Glue - S3 - Native Python

Answers (3)

Related Questions