Anujsharma
Anujsharma

Reputation: 37

Need to get max value for each specified column after converting table header into a column

I need a pointer/clue for below problem statement

Problem statement: I need to convert all table header into column(col_name) and get the maximum value of all those column, I am trying below logic but got stuck any suggestion/idea will be much helpful.

    **from pyspark.sql import Row
    from pyspark.sql.types import *
    from pyspark.sql.functions import col,lit,max
    df = sc.parallelize([ \
        Row(name='Alice', age=5, height=80), \
        Row(name='Mujkesh', age=10, height=90), \
        Row(name='Ganesh', age=15, height=100)]).toDF().createOrReplaceTempView("Test")
    df3 = spark.sql("Describe Test" )
    df4= df3.withColumn("Max_val",max(col(age))).show()
    given input :
    +---+------+-------+
    |age|height|   name|
    +---+------+-------+
    |  5|    80|  Alice|
    | 10|    90|Mujkesh|
    | 15|   100| Ganesh|
    +---+------+-------+
    expected output:
    +--------+---------+-------+-------+
    |col_name|data_type|comment|Max_val|
    +--------+---------+-------+-------+
    |     age|   bigint|   null|     15|
    |  height|   bigint|   null|    100|
    |    name|   string|   null|   null|
    +--------+---------+-------+-------+**

Upvotes: 0

Views: 56

Answers (1)

notNull
notNull

Reputation: 31510

Try with stack function and then group by to get the max value for the group.

  • Then join with the desc dataframe.

Example:

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = sc.parallelize([ \
    Row(name='Alice', age=5, height=80), \
    Row(name='Mujkesh', age=10, height=90), \
    Row(name='Ganesh', age=15, height=100)]).toDF()
df.createOrReplaceTempView("Test")
df3 = spark.sql("desc Test" )
df4=df.selectExpr("stack(3,'name',bigint(name),'age',age,'height',height) as (col_name,data)").groupBy(col("col_name")).agg(max(col("data")).alias("Max_val"))
df5=df3.join(df4,['col_name'],'inner').orderBy("col_name")
df5.show()
#+--------+---------+-------+-------+
#|col_name|data_type|comment|Max_val|
#+--------+---------+-------+-------+
#|     age|   bigint|   null|     15|
#|  height|   bigint|   null|    100|
#|    name|   string|   null|   null|
#+--------+---------+-------+-------+

Upvotes: 2

Related Questions