Amardeep Flora
Amardeep Flora

Reputation: 1380

Pyspark dataframe SQL

I want to convert the following statement into DF select statement:

Select 
 YY,
 PP,
 YYYY,
 PPPP,
 Min(ID) as MinId, 
 Max(ID) as MaxID 
from LoadTable

I have tried the following but it doesnt seem to be working :

df.select(df.ID,df.YY, df.PP,df.YYYY,df.PPPPP).agg({"ID": "max", "ID": "min"}).toPandas().to_csv(outputFile, sep="|", header=True, index=False)

Upvotes: 0

Views: 302

Answers (1)

Denny Lee
Denny Lee

Reputation: 3254

As you are performing an aggregate function, what you may be missing here is the GROUP BY statement. If so, your SQL statement would be:

SELECT YY, PP, YYYY, PPPP, Min(ID) as MinId, Max(ID) as MaxID 
  FROM LoadTable 
 GROUP BY YY, PP, YYYY, PPPP

The corresponding PySpark DataFrame statement would be then

from pyspark.sql import functions as F
df.groupBy(df.YY, df.PP, df.YYYY, df.PPPP).agg(F.min(df.ID), F.max(df.ID))

HTH!

Upvotes: 2

Related Questions