How to set partition for Window function for PySpark?

Question

I'm running a PySpark job, and I'm getting the following message:

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

What does the message indicate, and how do I define a partition for a Window operation?

EDIT:

I'm trying to rank on an entire column.

My data is organized as:

A
B
A
C
D

And I want:

A,1
B,3
A,1
C,4
D,5

I don't think there should by a .partitionBy() for this, only .orderBy(). The trouble is, this appears to cause performance degradation. Is there another way to achieve this without a Window function?

If I partition by the first column, the result would be:

A,1
B,1
A,1
C,1
D,1

Which I do not want.

How to set partition for Window function for PySpark?

Answers (1)

Related Questions