Reputation: 309
really hope someone can help me with it..
I want to read in all json files in this path "s3://.../year=2019/month=11/day=06/" how do i do it with glueContext.create_dynamic_frame_from_options ?
if I do glueContext.create_dynamic_frame_from_options("s3", format="json", connection_options = {"paths": [ "s3://.../year=2019/month=11/day=06/" ]})
, it won't work.
I had to list every single sub buckets ,I feel there should be a better way. For example: I had to do this df0 = glueContext.create_dynamic_frame_from_options("s3", format="json", connection_options = {"paths": [ "s3://.../year=2019/month=11/day=06/hour=20/minute=12/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=13/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=14/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=15/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=16/" ....]})
I have thousands of sub buckets to list so I really appreciate any guidance on how I can make my life easier. thank you!!
Upvotes: 0
Views: 1490
Reputation: 309
I found out the solution -> using "recurse" option when reading large group of files.
Upvotes: 1
Reputation: 11269
You're going to want to use a Glue Crawler to create tables in the Glue Data Catalog. You can then use the tables via
glueContext.create_dynamic_frame.from_catalog(
database="mydb",
table_name="mytable")
This AWS blog post explains how to deal with partitioned data in Glue https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Upvotes: 0