Reputation: 12264
I am investigating a Spark SQL job (Spark 1.6.0) that is performing poorly due to badly skewed data across the 200 partitions, most of the data is in 1 partition:
What I'm wondering is...is there anything in the Spark UI to help me find out more about how the data is partitioned? From looking at this I don't know which columns the dataframe is partitioned on. How can I find that out? (other than looking at the code - I'm wondering if there's anything in the logs and/or UI that could help me)?
Additional details, this is using Spark's dataframe API, Spark version 1.6. Underlying data is stored in parquet format.
Upvotes: 1
Views: 3495
Reputation: 478
The Spark UI and logs will not be terribly helpful for this. Spark uses a simple hash partitioning algorithm as the default for almost everything. As you can see here this basically recycles the Java hashCode
method.
I would suggest the following:
hashCode
of the data using spark and then take the modulus to see what the collision is.Once you find the source of the collision you can try to a few techniques to remove it:
hashCode
function of the key (the default one in Java isn't that great)Upvotes: 2