Doug T
Doug T

Reputation: 21

Spark Program twice as slow in Spark 2.2 than Spark 1.6

We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C and D do different kinds of processing on the DF one after another.

This spark job is run about 50 times an hour with varying number of input files depending on what files are currently available. The number of executors, cores, executor memory and number of agg partitions are calculated based on the number of input files and their sizes. These are calculated as such:

In our unit test environment we currently have 12 data nodes and the average file input size is 11MB / 630K records. So some examples would be:

Now the issue: analyzing 1095 runs with Spark 2.2 we see that Section C took on average 41.2 seconds and Section D took 47.73 seconds. In Spark 1.6.3 over 425 runs Section C took on average 23 seconds and Section D took 64.12 seconds. Sections A and B have nearly identical average runtimes between the two versions. Great improvement for Section D but big problem for Section C. This has not scaled well at all on our production cluster (314 datanodes).

Some details about Section C:

Is there anything off of the bat that would explain the discrepancy between spark 1.6 and 2.2?

I'm kind of new to spark tuning so please let me know if I can provide any more information.

Upvotes: 0

Views: 184

Answers (0)

Related Questions