Stefano
Stefano

Reputation: 367

How to replace all values of the same group with the minimum in PySpark

I am practising with PySpark and I need to obtain something like this. I have a dataset that I have grouped by title, let's say, and I need for each title to replace a (new) third column with the minimum of another column

title var new_col
item1 3 0
item1 4 0
item1 5 0
item2 2 0
item2 10 0

To this table

title var min
item1 3 3
item1 4 3
item1 5 3
item2 2 2
item2 10 2

For now I succeeded in grouping the first table, but clearly I only end up with two rows, the first being item1 with the min of '3' and item2 with '2'. Any suggestion on how I can proceed now?

Upvotes: 1

Views: 1557

Answers (3)

mck
mck

Reputation: 42422

You can use window functions for this purpose.

from pyspark.sql import functions as F, Window

df2 = df.select(
    'title',
    'var',
    F.min('var').over(Window.partitionBy('title')).alias('min')
)

Or, for a simpler syntax:

df2 = df.selectExpr('title', 'var', 'min(var) over(partition by title) min')

Upvotes: 2

nobody
nobody

Reputation: 11090

PartitionBy title orderBy var and get first row for reach partition.

import org.apache.spark.sql.expressions.Window   
import org.apache.spark.sql.functions.col

w = Window.partitionBy("title").orderBy(col("var"))
df.withColumn("min",row_number.over(w)).where($"min" === 1)

Or

from pyspark.sql import functions as f

df.groupBy("title").agg(f.min("var").alias("min"))

Upvotes: 0

Brijesh
Brijesh

Reputation: 41

Define a window function which partition data over "title" column. Then within each partition , get minimum of all rows in that partition. This will give you minimum value along with other columns.

from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy(df['title']) .orderBy(df['title'].desc()).rangeBetween(-sys.maxsize, sys.maxsize)
min_var =  (F.min(df['var']).over(windowSpec))
min_df =df.select("title", "var", min_var.alias("min"))
min_df.printSchema()
min_df.show()

Upvotes: 0

Related Questions