zhifff
zhifff

Reputation: 309

how do I read in tons of Json buckets using glueContext.create_dynamic_frame_from_options

really hope someone can help me with it..

I want to read in all json files in this path "s3://.../year=2019/month=11/day=06/" how do i do it with glueContext.create_dynamic_frame_from_options ?

if I do glueContext.create_dynamic_frame_from_options("s3", format="json", connection_options = {"paths": [ "s3://.../year=2019/month=11/day=06/" ]}), it won't work.

I had to list every single sub buckets ,I feel there should be a better way. For example: I had to do this df0 = glueContext.create_dynamic_frame_from_options("s3", format="json", connection_options = {"paths": [ "s3://.../year=2019/month=11/day=06/hour=20/minute=12/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=13/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=14/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=15/" ,"s3://.../year=2019/month=11/day=06/hour=20/minute=16/" ....]})

I have thousands of sub buckets to list so I really appreciate any guidance on how I can make my life easier. thank you!!

Upvotes: 0

Views: 1490

Answers (2)

zhifff
zhifff

Reputation: 309

I found out the solution -> using "recurse" option when reading large group of files.

Upvotes: 1

Ngenator
Ngenator

Reputation: 11269

You're going to want to use a Glue Crawler to create tables in the Glue Data Catalog. You can then use the tables via

glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="mytable")

This AWS blog post explains how to deal with partitioned data in Glue https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Upvotes: 0

Related Questions