cshin9
cshin9

Reputation: 1490

How to set partition for Window function for PySpark?

I'm running a PySpark job, and I'm getting the following message:

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

What does the message indicate, and how do I define a partition for a Window operation?

EDIT:

I'm trying to rank on an entire column.

My data is organized as:

A
B
A
C
D

And I want:

A,1
B,3
A,1
C,4
D,5

I don't think there should by a .partitionBy() for this, only .orderBy(). The trouble is, this appears to cause performance degradation. Is there another way to achieve this without a Window function?

If I partition by the first column, the result would be:

A,1
B,1
A,1
C,1
D,1

Which I do not want.

Upvotes: 7

Views: 13226

Answers (1)

eliasah
eliasah

Reputation: 40360

Given the information given to the question, at best I can provide a skeleton on how partitions should be defined on Window functions :

from pyspark.sql.window import Window

windowSpec = \
     Window \
     .partitionBy(...) \ # Here is where you define partitioning
     .orderBy(…)

This is equivalent to the following SQL :

OVER (PARTITION BY ... ORDER BY …)

So concerning partitioning specification :

It controls which rows will be in the same partition with the given row. You might want to make sure all rows having the same value for the partition column are collected to the same machine before ordering and calculating the frame.

If you don't give any partitioning specification, then all data must be collected to a single machine, thus the following error message :

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Upvotes: 7

Related Questions