Reputation: 406
I am really new to PySpark. All I want to do is to find the maximum of the "date" column and add a new column in the dataframe that has this maximum date for all the rows (repeating) such that:
A B C
a timestamp1 timestamp3
b timestamp2 --------------------> timestamp3
c timestamp3 timestamp3
I use the following line of code:
df.withColumn('dummy_column',f.lit((f.max('date'))).cast('timestamp')).show(9)
But I get the error:
> AnalysisException: grouping expressions sequence is empty, and
> '`part`' is not an aggregate function. Wrap '(CAST(max(`date`) AS
> TIMESTAMP) AS `new_column`)' in windowing function(s) or wrap '`part`'
> in first() (or first_value) if you don't care which value you get.;;
Can anyone help me understand why do I get this error and how can I work around it?
Upvotes: 4
Views: 4807
Reputation: 2003
There is one more alterenative to get max_timestamp
is just grpupBy()
the existing dataframe and use max()
in order to get the max timestamp
, take that in a variable
and use as per your requirement
df = spark.createDataFrame([(1,"2020-10-13"),(2,"2020-10-14"),(3,"2020-10-15")],[ "id","ts"])
df.show()
#df_max = df.groupBy("ts").agg(F.max("ts").alias("max_ts"))
df_max_var = df_max.collect()[0]['max_ts']
# Taking into a variable for future use
df = df.withColumn("dummy_col", F.lit(df_max_var))
df.show()
+---+----------+
| id| ts|
+---+----------+
| 1|2020-10-13|
| 2|2020-10-14|
| 3|2020-10-15|
+---+----------+
+---+----------+----------+
| id| ts| dummy_col|
+---+----------+----------+
| 1|2020-10-13|2020-10-15|
| 2|2020-10-14|2020-10-15|
| 3|2020-10-15|2020-10-15|
+---+----------+----------+
Upvotes: 1
Reputation: 42342
You're probably looking for:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('dummy_column',f.max('date').over(w).cast('timestamp')).show(9)
Aggregate functions like max
work with a window or a grouping operation. They cannot work on their own because you did not specify the range of rows over which the aggregate functions operates.
Upvotes: 2