bmcristi
bmcristi

Reputation: 103

Spark window operation with data skew

I am trying to make sense of some Spark metrics. It appears that even though I have a relative good distribution of data, one of the tasks takes considerably longer to finish than the others, apparently because of shuffle read time.

Is this still because of skewness? This appears while doing a window operation (note: column names were changed for privacy, but kept suggestive). This is processing 122B records with 240 executors (10 cores, 60GB, 14GB overhead - I know quite big, but found it to work better than smaller executors)

    df= = (df
    .withColumn(
        "duplicate_rank",
        f.rank().over(
            Window.partitionBy(
                "student_id",
                "student_address_id",
                "student_thesis_name",
            ).orderBy(
                "thesis_chronology_rank",
                "thesis_start_year",
                f.col("thesis_end_year").desc(),
            )
        ),
    ).filter(f.col("duplicate_rank") == 1))

enter image description here

Any suggestions for better approaching this are also appreciated. I tried with a random salt (20 buckets) and performing 2 window operations (with salt and without salt) without a lot of success as the amount of time Spark spends shuffling data around is ridiculously high. I guess it also has to resort to a sort, as all of the keys are strings.

LE: I was doing some tests and it appears that repartitioning on a column from the dataframe "studen_thesis_name" improves performance by a bunch. Checking the physical plan I just see a window operation after the exchange? Does Spark still do a shuffle after, since it requires 2 more columns form the partition by? If so, why doesn't it appear in the physical plan?

Upvotes: 0

Views: 104

Answers (1)

bmcristi
bmcristi

Reputation: 103

Right after going deeper on this, I was able to identify the problem as being Disk IO (EBS throttles). It could also be network, but that was not the case for me.

Upvotes: 1

Related Questions