Get distinct values of specific column with max of different columns

Question

I have the following DataFrame

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   6|null|null|
|   B|null|   5|null|
|   C|null|null|   7|
|   B|null|null|   4|
|   B|null|   2|null|
|   B|null|   1|null|
|   A|   4|null|null|
+----+----+----+----+

What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.

This snippet won't do what I want:

df.groupBy("col1").max("col2","col3","col4").show()

And this one just gives the max only for one column (1):

df.groupBy("col1").max("col2").show()

I even tried to merge the single outputs by this:

//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
  case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)

where test1 and test2 are DataFramesdone with queries as (1).

So how do I achive this nicely??

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   6|null|null|
|   B|null|   5|null|
|   C|null|null|   7|
+----+----+----+----+

Or even only the distinct values:

+----+
|col1|
+----+
|   A|
|   B|
|   C|
+----+

Thanks in advance! Best

Ashish Singh · Accepted Answer

You can use some thing like below :-

sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp 
where a=x.col2 or b= x.col3 or c=x.col4")

Will give the desired result.

Get distinct values of specific column with max of different columns

Answers (2)

Related Questions