Reputation: 772
I am running a Spark application that has a series of Spark SQL statements that are executed one after the other. The SQL queries are quite complex and the application is working (generating output). These days, I am working towards improving the performance of processing within Spark.
Please suggest whether Tungsten encoding has to be enabled separately or it kicks in automatically while running Spark SQL?
I am using Cloudera 5.13 for my cluster (2 node).
Upvotes: 2
Views: 1909
Reputation: 635
Tungsten became the default in Spark 1.5 and can be enabled in an earlier version by setting the spark.sql.tungsten.enabled = true. Even without Tungsten, SparkSQL uses a columnar storage format with Kyro serialization to minimize storage cost.
To make sure your code benefits as much as possible from Tungsten optimizations try to use the default Dataset API with Scala (instead of RDD).
Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. DataSet APIs are the most up to date and adds type-safety along with better error handling and far more readable unit tests.
Upvotes: 0
Reputation: 63201
It is enabled by default in spark 2.X (and maybe 1.6: but i'm not sure on that).
In any case you can do this
spark.sql.tungsten.enabled=true
That can be enabled on the spark-submit as follows:
spark-submit --conf spark.sql.tungsten.enabled=true
Tungsten
should be enabled if you see a *
next to the plan:
Also see: How to enable Tungsten optimization in Spark 2?
Upvotes: 3