Reputation: 590
I have a small code in SparkR and I would like to transform it to pyspark. I am not familiar with this windowPartitionBy, and repartition. Could you please help me to learn what this code is doing?
ws <- orderBy(windowPartitionBy('A'),'B')
df1 <- df %>% select(df$A, df$B, df$D, SparkR::over(lead(df$C,1),ws))
df2 <- repartition(col = df1$D)
Upvotes: 0
Views: 71
Reputation: 42392
In pyspark it would be equivalent to:
from pyspark.sql import functions as F, Window
ws = Window.partitionBy('A').orderBy('B')
df1 = df.select('A', 'B', 'D', F.lead('C', 1).over(ws))
df2 = df1.repartition('D')
The code is selecting, from df
, columns A, B, D and column C of the next row in the window ws
, into df1
.
Then it repartitions df1
using column D into df2
. Basically partitioning means how your dataframe is distributed in the memory/storage, and has direct implications on how it is processed in parallel. If you want to know more about repartitioning dataframes, you can go to https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.repartition
Upvotes: 2