Reputation: 1380
I want to convert the following statement into DF select statement:
Select
YY,
PP,
YYYY,
PPPP,
Min(ID) as MinId,
Max(ID) as MaxID
from LoadTable
I have tried the following but it doesnt seem to be working :
df.select(df.ID,df.YY, df.PP,df.YYYY,df.PPPPP).agg({"ID": "max", "ID": "min"}).toPandas().to_csv(outputFile, sep="|", header=True, index=False)
Upvotes: 0
Views: 302
Reputation: 3254
As you are performing an aggregate function, what you may be missing here is the GROUP BY
statement. If so, your SQL statement would be:
SELECT YY, PP, YYYY, PPPP, Min(ID) as MinId, Max(ID) as MaxID
FROM LoadTable
GROUP BY YY, PP, YYYY, PPPP
The corresponding PySpark DataFrame statement would be then
from pyspark.sql import functions as F
df.groupBy(df.YY, df.PP, df.YYYY, df.PPPP).agg(F.min(df.ID), F.max(df.ID))
HTH!
Upvotes: 2