Jakub Bares
Jakub Bares

Reputation: 159

PySpark: PartitionBy leaves the same value in column by which partitioned multiple times

I need to partitionBy in order to get distinct values in the time and match_instatid column, but it only produces distinct values about half the time

window_match_time_priority = Window.partitionBy(col("match_instatid"),col("time")).orderBy(col("match_instatid"),col("time"), priority_udf(col("type")).desc())
with_owner = match.select('match_instatid', "time", "type",
                F.last(col("team_instatid")).over(window_match_time_priority).alias('last_team'),                                                                   
                   F.last(col("type")).over(window_match_time_priority).alias('last_action')) \
                   .withColumn("owner", owner_assignment_udf(col("last_team"), col("last_action")))

You can see that the last_action column is duplicated for only some of the rows with same time, but should be for all. There should be only one value for owner and last_action per unique time value

Picture of partitionedBy dataframe

Upvotes: 2

Views: 1230

Answers (1)

murtihash
murtihash

Reputation: 8410

Try this as the window. For F.last to work the window must be unbounded. F.first works without it being unbounded.

window_match_time_priority = Window.partitionBy(col("match_instatid"),col("time")).orderBy(col("match_instatid"),col("time"), priority_udf(col("type")).desc())\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

Upvotes: 1

Related Questions