Swapnil Kashid
Swapnil Kashid

Reputation: 73

How to do auto script generation with AWS glue job with AWS java SDK

I am creating glue job creation using Java Sdk. It has only two required params Command and Glue version. But i need to create job with auto script generation. As we can do from console, we add data source, A proposed script generated by AWS Glue, Transform type, Data Target, schema n all. how to add these parameters to glue job using java sdk or even with aws glue api.

           CreateJobRequest req = new CreateJobRequest();
            req.setName("TestJob2");
            req.setRole("GlueS3Role");
            req.setGlueVersion("1.0");
            JobCommand command = new JobCommand();
            command.setName("glueetl");
            command.setPythonVersion("3");
            **// S3 location need not to be given, as script code is auto generated by AWS glue
           command.setScriptLocation(S3ScriptLocation);**
            req.setCommand(command);

            AWSGlue glueClient = AWSGlueClientBuilder.standard()
                                 .withRegion(Regions.US_EAST_1)
                                 .withCredentials(new AWSStaticCredentialsProvider(creds))
                                 .build();

            glueClient.createJob(req);
        }

Upvotes: 5

Views: 1117

Answers (2)

gkizior
gkizior

Reputation: 28

What you are looking for is createScript(CreateScriptRequest request) - a function in AWSJavaSDK AWSGlueClient class

Unfortunately the current version of AWS Glue SDK does not include simple functionality for generating ETL scripts. AWS Glue Console performs several operations behind the scenes itself when generating ETL script in the Create Job feature (you can see this by checking out your browswer's Network tab).

Mimic this by using "DAG"

You will need to make a Collection of CodeGenNode & CodeGenEdge and add them to your CreateScriptRequest with

.WithDagNodes(Collection<CodeGenNode> collection)

&

.WithDagEdges(Collection<CodeGenEdge> collection)

I suggest you first generate an ETL script in AWS Console and cross reference that result with information in "Generate Scala Code" example (this link is here for you to better understand "DAG")

I ended up explicitly building out this DAG structure. Here is a snippet of my solution:

    var dagNodes = new ArrayList<CodeGenNode>();
    var dagEdges = new ArrayList<CodeGenEdge>();

    //datasource
    dagEdges.add(new CodeGenEdge().withSource(dataSourceName).withTarget(applyMappingName));
    ArrayList<CodeGenNodeArg> dataSourceArgs = new ArrayList<CodeGenNodeArg>();
    dataSourceArgs.add(new CodeGenNodeArg().withName("database").withValue(String.format("\"%s\"", databaseName)));
    dataSourceArgs.add(new CodeGenNodeArg().withName("table_name").withValue(String.format("\"%s\"", tableName)));
    dataSourceArgs.add(new CodeGenNodeArg().withName("transformation_ctx").withValue(String.format("\"%s\"", dataSourceName)));
    dagNodes.add(new CodeGenNode().withId(dataSourceName).withNodeType("DataSource").withArgs(dataSourceArgs));

... //can build out many 'operations' - datasource, applymapping, selectfields, resolvechoice, datasink

    var createScriptRequest = new CreateScriptRequest()
        .withDagEdges(dagEdges)
        .withDagNodes(dagNodes)
        .withLanguage(Language.PYTHON);

    awsGlueClient.createScript(createScriptRequest)

Then simply upload this result to S3 using AmazonS3 and use this path for "setScriptLocation"

PutObjectResult putObject(String bucketName, String key, String content)

Upvotes: 0

Taras Hnativ
Taras Hnativ

Reputation: 11

I hope the implementation of AWS Glue client and logic to trigger job will help implement auto-generation of glue job in the same way.

Glue Client:

public GlueClient createClient() {
        return GlueClient.builder()
                .region(Region.of(regionName))
                .credentialsProvider(ProfileCredentialsProvider.create(profileName)).build();

Glue job runner:

public static String runGlueJob(GlueClient glueClient, String jobName, Map<String, String> glueArguments) {

        StartJobRunResponse response = glueClient.startJobRun(StartJobRunRequest.builder().jobName(jobName).arguments(glueArguments).build());
        String jobId = response.jobRunId();
        logger.info("JobId: " + jobId);

        return jobId;

    }

To create a new AWS Glue job definition we can do next:

CreateJobResult jobResult = glueClient.createJob(CreateJobRequest.builder()
                .command(JobCommand.builder().pythonVersion("").scriptLocation("").name("").build())
                .defaultArguments()
                .description()
                .glueVersion()
                .logUri()
                .name()
                .numberOfWorkers()
                .role()
                .tags()
                .build());            

and then trigger job

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/AWSGlueClient.html

Upvotes: 1

Related Questions