Owais Ajaz
Owais Ajaz

Reputation: 264

AWS Glue with Athena

We are in a phase where we are migrating all of our spark job written in scala to aws glue.

Current Flow: Apache Hive -> Spark(Processing/Transformation) -> Apache Hive -> BI

Required Flow: AWS S3(Athena) -> Aws Glue(Spark Scala -> Processing/Transformation) -> AWS S3 -> Athena -> BI

TBH i got this task yesterday and i am doing R&D on it. My questions are :

  1. Can we run same code in apache glue as it has dynamic frame which can be converted to dataframes but require changes in code.
  2. Can we read data from aws athena using spark sql api in aws glue like we normally do in spark.

Upvotes: 0

Views: 1847

Answers (2)

Shubham Jain
Shubham Jain

Reputation: 5536

Aws glue extends the capabilities of Apache Spark. Hence you can always use your code as it is.

The only changes you need to do is to change the creation of session variable and parsing of arguments provided. You can run plain old pyspark code without even creating dynamic frames.

def createSession():
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    return sc, glueContext, spark, job

#To handle the arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'arg1', 'arg2'])
arg1 = args['arg1'].split(',')
arg2 = args['arg2'].strip()

#To initialize the job
job.init(args['JOB_NAME'], args)
#your code here 
job.commit()

And it also supports spark sql over glue catalog.

Hope it helps

Upvotes: 1

Owais Ajaz
Owais Ajaz

Reputation: 264

I am able to run my current code with minor changes. i have built sparkSession and use that session to query glue hive enabled catalog table. we need to add this parameter in our job --enable-glue-datacatalog

SparkSession.builder().appName("SPARK-DEVELOPMENT").getOrCreate()
var sqlContext = a.sqlContext
sqlContext.sql("use default")
sqlContext.sql("select * from testhive").show()

Upvotes: 0

Related Questions