Gyan Kumar Mishra
Gyan Kumar Mishra

Reputation: 103

Can we use java for ETL in AWS Glue?

Can we use java for ETL in AWS Glue? It seems like there is only two option for Glue ETL programming i.e. Python and Scala.

Upvotes: 3

Views: 7938

Answers (2)

the.1337.house
the.1337.house

Reputation: 170

A bit late to the answer but hopefully that helps out someone from now on.

You actually can run Java code but not by supplying the source code in the "Script" section like you can with Python or Scala.

The Glue environment contains a JRE (currently fixed at version 1.8) in order to be able to run Scala as it is a JVM-based language. To achieve this you will need to ship your code as a .jar and find a way to invoke it.

In our case we use Python to trigger a sub-process like:

import sys
import json
import boto3
import subprocess

...

x = subprocess.run([
    'java', '-jar', '<your_jar>.jar', '--foo=bar'
])

print('Response code: ' + str(x.returncode))

if x.returncode != 0:
            raise Exception(f"Glue job failed with exit code: {x.returncode}")

Now you'll ask, how do I get access to my .jar? One answer is S3. At the Advanced properties section (expand), there is a Libraries section as seen below:

Libraries Section

There, just add the fully qualified S3 path to your .jar, to the Dependent JARs path section and as seen above. The runtime path for a Glue instance (3.0, 4.0 and at the moment of writing) is /tmp and that's the path the .jar is copied to while initializing the instance. That's why you can execute it implicitly pointing to ./.

From experience, performance is not bad at all (no different to an EC2 instance running the same .jar) but you may need to tweak some things on the spawned JVM to get better results. We're using SpringBoot in headless mode with command-line enhancing modules and just works great.

Edit: We had decided initially to wrap around Python as we weren't sure how to trigger the SpringBoot application from Scala. Digging into the Jar internals of the standalone SpringBoot we figured out that the entry point is the JarLauncher class as defined in the Manifest file. To trigger a SpringBoot standalone jar, include it to the classpath as explained above, switch the Glue Job's language to Scala 2.0 and include this Scala snippet:

import org.springframework.boot.loader.JarLauncher

object DemoApp {
  def main(args: Array[String]): Unit = {
    JarLauncher.main(args)
  }
}

Note: our SpringBoot dependencies are fixed to 2.5.9 that are Java 8 - based.

Upvotes: 0

Geek Logbook
Geek Logbook

Reputation: 606

No

Q: What programming language can I use to write my ETL code for AWS Glue?

You can use either Scala or Python.

Resource: AWS Glue FAQ

Upvotes: 4

Related Questions