user3046211
user3046211

Reputation: 696

Select a row and display the column name based on max value in pyspark

I have a Pyspark data frame as

| ID|colA|colB|colC|
+---+----+----+----+
|ID1|   3|   5|   6|
|ID2|   4|  12|   7|
|ID3|   2|  20|  22|
+---+----+----+----+

I want to select row ID3 and choose the maximum value of three columns and then display the column name of maximum value. So if I choose the max value of three columns for row ID3 it should return as below :

|colC|
+----+
|22  |
+----+

So my question is how do we select a single row and select a column name based on max column value of that selected row from pyspark DataFrame ?

Upvotes: 0

Views: 647

Answers (1)

blackbishop
blackbishop

Reputation: 32660

You can filter on ID = 'ID3' and use greatest function on a list of structs to get the max column name and its value. Then pivot to get it as column:

from pyspark.sql import functions as F

df1 = df.filter("ID = 'ID3'").withColumn(
    "max_value",
    F.greatest(
        *[F.struct(F.col(c).alias("v"), F.lit(c).alias("name")) for c in df.columns[1:]]
    )
).select("max_value.*").groupBy().pivot("name").agg(F.first("v"))

df1.show(truncate=False)

#+----+
#|colC|
#+----+
#|22  |
#+----+

Upvotes: 1

Related Questions