Identifying why data is skewed in Spark

Question

I am investigating a Spark SQL job (Spark 1.6.0) that is performing poorly due to badly skewed data across the 200 partitions, most of the data is in 1 partition: What I'm wondering is...is there anything in the Spark UI to help me find out more about how the data is partitioned? From looking at this I don't know which columns the dataframe is partitioned on. How can I find that out? (other than looking at the code - I'm wondering if there's anything in the logs and/or UI that could help me)?

Additional details, this is using Spark's dataframe API, Spark version 1.6. Underlying data is stored in parquet format.

Ed Kohlwey · Accepted Answer

The Spark UI and logs will not be terribly helpful for this. Spark uses a simple hash partitioning algorithm as the default for almost everything. As you can see here this basically recycles the Java hashCode method.

I would suggest the following:

Try to debug by sampling and printing the contents of the RDD or data frame. See if there's obvious issues with the data distribution (ie. low variance or low cardinality) of the key.
If thats ineffective, you can work back from the logs and UI to figure our how many partitions there are. You can find the hashCode of the data using spark and then take the modulus to see what the collision is.

Once you find the source of the collision you can try to a few techniques to remove it:

See if there's a better key you can use
See if you can improve the hashCode function of the key (the default one in Java isn't that great)
See if you can process the data in two steps by doing an initial scatter/gather step to force some parallelism and reduce the processing overhead for that one partition. This is probably the trickiest optimization to get right of those mentioned here. Basically, partition the data once using a random number generator to force some initial parallel combining of the data, then push it through again with the natural partitioner to get the final result. This requires that the operation you're applying be transitive and associative. This technique hits the network twice and is therefore very expensive unless the data is really actually that highly skewed.

Identifying why data is skewed in Spark

Answers (1)

Related Questions