Reputation: 720
I have the following DataFrame
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
| B|null|null| 4|
| B|null| 2|null|
| B|null| 1|null|
| A| 4|null|null|
+----+----+----+----+
What I would like to do in Spark is to return all entries in col1
in the case it has a maximum value for one of the columns col2
, col3
or col4
.
This snippet won't do what I want:
df.groupBy("col1").max("col2","col3","col4").show()
And this one just gives the max only for one column (1):
df.groupBy("col1").max("col2").show()
I even tried to merge the single outputs by this:
//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)
where test1
and test2
are DataFrames
done with queries as (1).
So how do I achive this nicely??
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
+----+----+----+----+
Or even only the distinct values:
+----+
|col1|
+----+
| A|
| B|
| C|
+----+
Thanks in advance! Best
Upvotes: 2
Views: 1807
Reputation: 533
You can use some thing like below :-
sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp
where a=x.col2 or b= x.col3 or c=x.col4")
Will give the desired result.
Upvotes: 2
Reputation: 564
It can be solved like this:
df.registerTempTable("temp")
spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")
spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show
Upvotes: 1