Reputation: 6657
I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit columns. Each job is very similar, but simply changes the connection string source and target.
Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through?
Any examples or documentation would be most appreciated
Upvotes: 28
Views: 69089
Reputation: 63
We can provide user defined parameters/args by using appropiate %%configure magic in your notebook like below, you can also check the parameters reflecting in the job details > advanced properties > parameters :
%%configure { "--input0": "s3://bucket1/", "--input1": "s3://bucket2/" }
https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html
Upvotes: 0
Reputation: 1939
In the below example I present how to use Glue job input parameters in the code. This code takes the input parameters and it writes them to the flat file.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER'])
job.init(args['JOB_NAME'], args)
v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}]
df=sc.parallelize(v_list).toDF()
df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/")
job.commit()
import boto3
def lambda_handler(event, context):
glue = boto3.client('glue')
myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole',
Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'},
DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"}
)
glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"})
Useful links:
Upvotes: 50