Stasko Staskov
Stasko Staskov

Reputation: 13

How to translate a pandas group by without aggregation to pyspark?

I am trying to convert the following pandas line into pyspark:

df = df.groupby('ID', as_index=False).head(1)

Now, I am familiar with the pyspark df.groupby("col1", "col2") method in pyspark, as well as the following to get whatever the first element is within a group:

df = df.withColumn("row_num", row_number().over(Window.partitionBy("ID").orderBy("SOME_DATE_COLUMN"))).where(col("row_num") < 2)

However, without an orderBy argument, this grouping and fetching of the first element in each group doesn't work (and I am literally trying to convert from pandas to spark, whatever the pandas line does):

An error occurred while calling o2547.withColumn. : org.apache.spark.sql.AnalysisException: Window function row_number() >requires window to be ordered, please add ORDER BY clause. For example >SELECT row_number()(value_expr) OVER (PARTITION BY window_partition >ORDER BY window_ordering) from table

Looking at the pandas groupby documentation, I cannot grasp what groupby does without a following sort/agg function applied to the groups; i.e. what is the default order within a group from which the $.head(1)$ fetches the first element?

Upvotes: 1

Views: 384

Answers (1)

cronoik
cronoik

Reputation: 19495

It depends on the order of your pandas dataframe before the groupby. From the pandas groupby documentation:

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

Converting the pandas beheaviour exactly to pyspark is impossible as pyspark dataframes aren't ordered. But if your data source can provide a row number or something like that, it is possible.

Upvotes: 0

Related Questions