MLam
MLam

Reputation: 171

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.

The data looks like:

id  time  name
132  12   Lucy
132  10   John
132  15   Sam
78   11   Kate
78   7    Julia
78   2    Vivien
245  22   Tom

I would like to get this:

id  time  name
132  10   John
132  12   Lucy
132  15   Sam
78   2    Vivien
78   7    Julia
78   11   Kate
245  22   Tom

I tried

df.orderby(['id','time'])

But I don't need to sort "id".

I have two questions:

  1. Can I just sort "time" within same "id"? and How?
  2. Will be more efficient if I just sort "time" than using orderby() to sort both columns?

Upvotes: 8

Views: 8550

Answers (1)

penguin
penguin

Reputation: 459

This is exactly what windowing is for. You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.

# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)

Now use this window over any function: For e.g.: let's say you want to create a column of the time delta between each row within the same group

import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))

I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.

If you just want to view your result, you could find the row number and sort by that as well.

df.withColumn("order", f.row_number().over(w)).sort("order").show()

Upvotes: 8

Related Questions