Reputation: 13
I have a AWS Glue job in pyspark language which loads data from S3/Glue catalog db to snowflake. How can we achieve passing table names as parameters and run the aws glue job in parallel.
can we do it inside glue job or any lambda functions?
Please suggest and share any code/articles.
Thank you in advance.
Thanks, Jo
Upvotes: 0
Views: 2045
Reputation: 10144
AWS Glue lets you enter your own script, so it's very flexible. You can pass table names as parameters:
In this case, Glue job can process these tables sequentially:
If you want to run separate Glue jobs for each table to process them in parallel, then you need to pass only one table to the Glue job, and call the same job for multiple time with a different table name.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
Glue launches an EMR cluster based on the "Number of workers".
I do not know how many tables you will process, and the frequency of calling the Glue job, but it could be better to process the tables sequentialy with a bigger cluster to utilize resources.
Upvotes: 2