WorkBench
WorkBench

Reputation: 93

How to add a constant column with maximum value in a pyspark dataframe without grouping by

Suppose that we have a PySpark dataframe with two columns, ID (it is unique) and VALUE.

I need to add a third column that contains always the same value, i.e. the maximum value of the column VALUE. I observe that in this case it doesn't make any sense to group by the ID because I need a global maximum.

It's sound very simple and probably it is, but I only saw solutions involving grouping by that do not fit my case. I tried a lot of things but nothing worked.

I need a solution only in PySpark/Python Code. Thanks a lot!

Upvotes: 0

Views: 1835

Answers (2)

Karthik
Karthik

Reputation: 1171

In your case you can use window functions. And I presume your value column contains list of values.

from pyspark.sql.functions import max,desc

from pyspark.sql.window import Window

spec= Window.partitionBy('ID').orderBy(desc('VALUE'))

newDF = df.withColumn('maxValue',max('VALUE').over(spec))

Upvotes: 0

fmarm
fmarm

Reputation: 4284

You can do this:

from pyspark.sql.functions import max, lit
# compute max value from VALUE column
max_df = df.select(max(df['VALUE'])).collect()
# max_df is a 1 row 1 column dataframe, you need to extract the value
max_val = max_df[0][0]
# create new column in df, you need lit as you have a constant value
df = df.withColumn('newcol',lit(max_val))

Upvotes: 2

Related Questions