Reputation: 21
We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C and D do different kinds of processing on the DF one after another.
This spark job is run about 50 times an hour with varying number of input files depending on what files are currently available. The number of executors, cores, executor memory and number of agg partitions are calculated based on the number of input files and their sizes. These are calculated as such:
In our unit test environment we currently have 12 data nodes and the average file input size is 11MB / 630K records. So some examples would be:
Now the issue: analyzing 1095 runs with Spark 2.2 we see that Section C took on average 41.2 seconds and Section D took 47.73 seconds. In Spark 1.6.3 over 425 runs Section C took on average 23 seconds and Section D took 64.12 seconds. Sections A and B have nearly identical average runtimes between the two versions. Great improvement for Section D but big problem for Section C. This has not scaled well at all on our production cluster (314 datanodes).
Some details about Section C:
the DF from section A is joined on a smaller table (~1000 rows) four times and these are unioned together, basically like:
SELECT *
FROM t1 JOIN t2
WHERE t1.a = t2.x AND t2.z = 'a'
UNION ALL
SELECT *
FROM t1 JOIN t2
WHERE t1.b = t2.x AND t2.z = 'b'
UNION ALL
SELECT *
FROM t1 JOIN t2
WHERE t1.c = t2.x1 AND t1.d = t2.x2 AND t2.z = 'c'
UNION ALL
SELECT *
FROM t1 JOIN t2
WHERE t1.e = t2.y1 AND t1.f = t2.y2 AND t2.z = 'd'
This query is performed once more with slightly different parameters.
The result of each of those queries is cached because they are filtered a few more times which are then written out to parquet.
Is there anything off of the bat that would explain the discrepancy between spark 1.6 and 2.2?
I'm kind of new to spark tuning so please let me know if I can provide any more information.
Upvotes: 0
Views: 184