Reputation: 83
I have to migrate to Spark 2.1 an application written in Scala 2.10.4 using Spark 1.6.
The application treats text files with around 7GB of dimension, and contains several rdd transformations.
I was told to try to recompile it with scala 2.11, which should be enough to make it work with Spark 2.1. This sounds strange to me as I know in Spark 2 there are some relevant changes, like:
I managed to recompile the application in spark 2 with scala 2.11 with only minor changes due to Kryo Serializer registration. I still have some runtime error that I am trying to solve and I am trying to figure out what will come next.
My question regards what changes are "neccessary" in order to make the application work as before, and what changes are "recommended" in terms of performance optimization (I need to keep at least the same level of performances), and whatever you think could be useful for a newbie in spark :).
Thanks in advance!
Upvotes: 0
Views: 418
Reputation: 27373
I did the same 1 year ago, there are not many changes you need to do, what comes in my mind:
spark/sqlContext
, then just extract this variable from SparkSession
instace at the beginning of your code.df.map
switched to RDD
API in Spark 1.6, in Spark 2.+ you stay in DataFrame API (which now has a map
method). To get same functionality as before, replace df.map
with df.rdd.map
. The same is true for df.foreach
and df.mapPartitions
etcunionAll
in Spark 1.6 is just union
in Spark 2.+What you should consider (but would require more work):
collect_list
can be used with structs, in Spark 1.6 it could only be used with primitives. This can simplify some thingsleftanti
join was introducedUpvotes: 2