Reputation: 95
I have pyspark script which I can run in AWS GLUE. But everytime I am creating job from UI and copying my code to the job .Is there anyway I can automatically create job from my file in s3 bucket. (I have all the library and glue context which will be used while running )
Upvotes: 3
Views: 8189
Reputation: 4788
I created an open source library called datajob
to deploy and orchestrate glue jobs. You can find it on github https://github.com/vincentclaes/datajob and on pypi
pip install datajob
npm install -g [email protected]
you create a file datajob_stack.py
that describes your glue jobs and how they are orchestrated:
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:
# here we define 3 glue jobs with a relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
job_path="data_pipeline_simple/task1.py",
)
task2 = GlueJob(
datajob_stack=datajob_stack,
name="task2",
job_path="data_pipeline_simple/task2.py",
)
task3 = GlueJob(
datajob_stack=datajob_stack,
name="task3",
job_path="data_pipeline_simple/task3.py",
)
# we instantiate a step functions workflow and add the sources
# we want to orchestrate.
with StepfunctionsWorkflow(
datajob_stack=datajob_stack, name="data-pipeline-simple"
) as sfn:
[task1, task2] >> task3
To deploy your code to glue execute:
export AWS_PROFILE=my-profile
datajob deploy --config datajob_stack.py
any feedback is much appreciated!
Upvotes: 2
Reputation: 2144
I wrote script which does following:
You may write shell script to do it.
Upvotes: 0
Reputation: 4750
Another alternative is to use AWS CloudFormation. You can define all AWS resources you want to create (not only Glue jobs) in a template file and then update stack whenever you need from AWS Console or using cli.
Template for a Glue job would look like this:
MyJob:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
ScriptLocation: "s3://aws-glue-scripts//your-script-file.py"
DefaultArguments:
"--job-bookmark-option": "job-bookmark-enable"
ExecutionProperty:
MaxConcurrentRuns: 2
MaxRetries: 0
Name: cf-job1
Role: !Ref MyJobRole # reference to a Role resource which is not presented here
Upvotes: 5
Reputation: 1939
Yes, it is possible. For instance, you can use boto3 framework for this purpose.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
Upvotes: 0