Reputation: 264
We are in a phase where we are migrating all of our spark job written in scala to aws glue.
Current Flow: Apache Hive -> Spark(Processing/Transformation) -> Apache Hive -> BI
Required Flow: AWS S3(Athena) -> Aws Glue(Spark Scala -> Processing/Transformation) -> AWS S3 -> Athena -> BI
TBH i got this task yesterday and i am doing R&D on it. My questions are :
Upvotes: 0
Views: 1847
Reputation: 5536
Aws glue extends the capabilities of Apache Spark. Hence you can always use your code as it is.
The only changes you need to do is to change the creation of session variable and parsing of arguments provided. You can run plain old pyspark code without even creating dynamic frames.
def createSession():
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
return sc, glueContext, spark, job
#To handle the arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'arg1', 'arg2'])
arg1 = args['arg1'].split(',')
arg2 = args['arg2'].strip()
#To initialize the job
job.init(args['JOB_NAME'], args)
#your code here
job.commit()
And it also supports spark sql over glue catalog.
Hope it helps
Upvotes: 1
Reputation: 264
I am able to run my current code with minor changes. i have built sparkSession and use that session to query glue hive enabled catalog table. we need to add this parameter in our job --enable-glue-datacatalog
SparkSession.builder().appName("SPARK-DEVELOPMENT").getOrCreate()
var sqlContext = a.sqlContext
sqlContext.sql("use default")
sqlContext.sql("select * from testhive").show()
Upvotes: 0