pyspark dataframe get second lowest value for each row

Question

I would like to query, if anyone has an idea, how to get the second lowest value in a row of Dataframe in pyspark.

For example:

Input Dataframe:

Col1  Col2  Col3  Col4 
83    32    14    62   
63    32    74    55   
13    88     6    46

Expected output:

Col1  Col2  Col3  Col4 Res
83    32    14    62   32   
63    32    74    55   55   
13    88     6    46   13

notNull · Accepted Answer

We can use concat_ws function to concat all columns for the row then use split to create an array.

use array_sort function to sort with in the array and extract second element[1] of the array.

Example:

from pyspark.sql.functions import *

df=spark.createDataFrame([('83','32','14','62'),('63','32','74','55'),('13','88','6','46')],['Col1','Col2','Col3','Col4'])

df.selectExpr("array_sort(split(concat_ws(',',Col1,Col2,Col3,Col4),','))[1] Res").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

More Dynamic Way:

df.selectExpr("array_sort(split(concat_ws(',',*),','))[1]").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

EDIT:

#adding Res column to the dataframe
df1=df.selectExpr("*","array_sort(split(concat_ws(',',*),','))[1] Res")
df1.show()

#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|Res|
#+----+----+----+----+---+
#|  83|  32|  14|  62| 32|
#|  63|  32|  74|  55| 55|
#|  13|  88|   6|  46| 46|
#+----+----+----+----+---+

pyspark dataframe get second lowest value for each row

Answers (2)

Related Questions