Reputation: 696
I have a Pyspark data frame as
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 6|
|ID2| 4| 12| 7|
|ID3| 2| 20| 22|
+---+----+----+----+
I want to select row ID3
and choose the maximum value of three columns and then display the column name of maximum value. So if I choose the max value of three columns for row ID3
it should return as below :
|colC|
+----+
|22 |
+----+
So my question is how do we select a single row and select a column name based on max column value of that selected row from pyspark DataFrame ?
Upvotes: 0
Views: 647
Reputation: 32660
You can filter on ID = 'ID3'
and use greatest
function on a list of structs to get the max column name and its value. Then pivot to get it as column:
from pyspark.sql import functions as F
df1 = df.filter("ID = 'ID3'").withColumn(
"max_value",
F.greatest(
*[F.struct(F.col(c).alias("v"), F.lit(c).alias("name")) for c in df.columns[1:]]
)
).select("max_value.*").groupBy().pivot("name").agg(F.first("v"))
df1.show(truncate=False)
#+----+
#|colC|
#+----+
#|22 |
#+----+
Upvotes: 1