Reputation: 395
Within AWS Glue how do I deal with files from S3 that will change every week.
Example: Week 1: “filename01072018.csv” Week 2: “filename01142018.csv”
These files are setup in the same format but I need Glue to be able to change per week to load this data into Redshift from S3. The code for Glue uses native Python as the backend.
Upvotes: 1
Views: 305
Reputation: 225
If you have two different TYPES of files (with different internal formats), they must be in separate folder hierarchies. There is no way to tell a crawler to only look for redfile*.csv
and ignore bluefile%.csv
. Instead user separate hierarchies like:
s3://my-bucket/redfiles/
redfile01072018.csv
redfile01142018.csv
...
s3://my-bucket/bluefiles/
bluefile01072018.csv
bluefile01142018.csv
...
Setup two crawlers, one crawling s3://my-bucket/redfiles/
and the other crawling s3://my-bucket/bluefiles/
Upvotes: 0
Reputation: 443
AWS Glue should be able to process all the files in a folder irrespective of the name in a single job. If you don’t want the old file to be processed again move it using boto3 api for s3 to another location after each run.
Upvotes: 0
Reputation: 640
AWS Glue crawlers should be able to just find your CSV files as they are named without any configuration on your part.
For instance, my Kinesis stream produces files that have paths and names that look like these:
my_events_folder/2018/02/13/20/my-prefix-3-2018-02-13-20-18-28-112ab3f0-5794-4f77-9a84-83efafeecabc
my_events_folder/2018/02/13/20/my-prefix-2-2018-02-13-20-12-00-7f2efb62-827b-46a6-83c4-b4c52dd87d60
...
AWS Glue just finds these files and classifies them automatically. Hope this helps.
Upvotes: 1