Spark Dataframe maximum on Several Columns of a Group

Question

How can I get the maximum value for different (string and numerical) types of columns in a DataFrame in Scala using Spark?

Let say that is my data

+----+-----+-------+------+
|name|value1|value2|string|
+----+-----+-------+------+
|   A|    7|      9|   "a"|
|   A|    1|     10|  null|
|   B|    4|      4|   "b"|
|   B|    3|      6|  null|
+----+-----+-------+------+

and the desired outcome is:

+----+-----+-------+------+
|name|value1|value2|string|
+----+-----+-------+------+
|   A|    7|     10|   "a"|
|   B|    4|      6|   "b"|
+----+-----+-------+------+

Is there a function like in pandas with apply(max,axis=0) or do I have to write a UDF?

What I can do is a df.groupBy("name").max("value1") but I canot perform two max in a row neither does a Sequence work in max() function.

Any ideas to solve the problem quickly?

Tawkir · Accepted Answer

Use this

df.groupBy("name").agg(max("value1"), max("value2"))

Spark Dataframe maximum on Several Columns of a Group

Answers (1)

Related Questions