Antonio Paladini
Antonio Paladini

Reputation: 83

What changes do I have to do to migrate an application from Spark 1.5 to Spark 2.1?

I have to migrate to Spark 2.1 an application written in Scala 2.10.4 using Spark 1.6.

The application treats text files with around 7GB of dimension, and contains several rdd transformations.

I was told to try to recompile it with scala 2.11, which should be enough to make it work with Spark 2.1. This sounds strange to me as I know in Spark 2 there are some relevant changes, like:

I managed to recompile the application in spark 2 with scala 2.11 with only minor changes due to Kryo Serializer registration. I still have some runtime error that I am trying to solve and I am trying to figure out what will come next.

My question regards what changes are "neccessary" in order to make the application work as before, and what changes are "recommended" in terms of performance optimization (I need to keep at least the same level of performances), and whatever you think could be useful for a newbie in spark :).

Thanks in advance!

Upvotes: 0

Views: 418

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

I did the same 1 year ago, there are not many changes you need to do, what comes in my mind:

  • if your code is cluttered with spark/sqlContext, then just extract this variable from SparkSession instace at the beginning of your code.
  • df.map switched to RDD API in Spark 1.6, in Spark 2.+ you stay in DataFrame API (which now has a map method). To get same functionality as before, replace df.map with df.rdd.map. The same is true for df.foreach and df.mapPartitions etc
  • unionAll in Spark 1.6 is just union in Spark 2.+
  • The databrick csv library is now included in Spark.
  • When you insert into a partitioned hive table, then the partition columns must now come as last column in the schema, in Spark 1.6 it had to be the first column

What you should consider (but would require more work):

  • migrate RDD-Code into Dataset-Code
  • enable CBO (cost based optimizer)
  • collect_list can be used with structs, in Spark 1.6 it could only be used with primitives. This can simplify some things
  • Datasource API was improved/unified
  • leftanti join was introduced

Upvotes: 2

Related Questions