Trying to create a column with the maximum timestamp in PySpark DataFrame

I am really new to PySpark. All I want to do is to find the maximum of the "date" column and add a new column in the dataframe that has this maximum date for all the rows (repeating) such that:

A      B                                        C
a  timestamp1                              timestamp3
b  timestamp2    -------------------->     timestamp3
c  timestamp3                              timestamp3

I use the following line of code:

df.withColumn('dummy_column',f.lit((f.max('date'))).cast('timestamp')).show(9)

But I get the error:

> AnalysisException: grouping expressions sequence is empty, and
> '`part`' is not an aggregate function. Wrap '(CAST(max(`date`) AS
> TIMESTAMP) AS `new_column`)' in windowing function(s) or wrap '`part`'
> in first() (or first_value) if you don't care which value you get.;;

Can anyone help me understand why do I get this error and how can I work around it?

Upvotes: 4

Views: 4807

Answers (2)

dsk
dsk

Reputation: 2003

There is one more alterenative to get max_timestamp is just grpupBy() the existing dataframe and use max() in order to get the max timestamp , take that in a variable and use as per your requirement

Create the Df Here

df = spark.createDataFrame([(1,"2020-10-13"),(2,"2020-10-14"),(3,"2020-10-15")],[ "id","ts"])
df.show()
#df_max = df.groupBy("ts").agg(F.max("ts").alias("max_ts"))
df_max_var = df_max.collect()[0]['max_ts']
# Taking into a variable for future use
df = df.withColumn("dummy_col", F.lit(df_max_var))
df.show()

Input

+---+----------+
| id|        ts|
+---+----------+
|  1|2020-10-13|
|  2|2020-10-14|
|  3|2020-10-15|
+---+----------+

Output

+---+----------+----------+
| id|        ts| dummy_col|
+---+----------+----------+
|  1|2020-10-13|2020-10-15|
|  2|2020-10-14|2020-10-15|
|  3|2020-10-15|2020-10-15|
+---+----------+----------+

Upvotes: 1

mck
mck

Reputation: 42342

You're probably looking for:

import pyspark.sql.functions as f
from pyspark.sql.window import Window

w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('dummy_column',f.max('date').over(w).cast('timestamp')).show(9)

Aggregate functions like max work with a window or a grouping operation. They cannot work on their own because you did not specify the range of rows over which the aggregate functions operates.

Upvotes: 2

Related Questions