Jessi joseph
Jessi joseph

Reputation: 191

How to select all columns in spark sql query in aggregation function

Hi I am new to spark sql.

I have a query like this.

val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))

This prints only 3 columns.

tagShortID,Timestamp,maxAvgValue

But I want to display all the column along with this column.Any help or suggestion would be appreciated.

Upvotes: 3

Views: 5783

Answers (2)

Daniel de Paula
Daniel de Paula

Reputation: 17872

One alternative, usually good for your specific case is to use Window Functions, because it avoids the need to join with the original data:

import org.apache.spark.expressions.Window
import org.apache.spark.sql.functions._

val windowSpec = Window.partitionBy("tagShortID", "Timestamp")

val result = averageDF.withColumn("maxAvgValue", max($"RSSI_Weight_avg").over(windowSpec))

You can find here a good article explaining the Window Functions functionality in Spark.

Please note that it requires either Spark 2+ or a HiveContext in Spark versions 1.4 ~ 1.6.

Upvotes: 1

koiralo
koiralo

Reputation: 23119

Here is the simple example with the column name you have

This is your averageDF dataframe with dummy data

+----------+---------+---------------+---------+--------+---------------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|
+----------+---------+---------------+---------+--------+---------------+
|         2|        2|              2|        2|       2|              2|
|         2|        2|              2|        2|       2|              2|
|         2|        2|              2|        2|       2|              2|
|         1|        1|              1|        1|       1|              1|
|         1|        1|              1|        1|       1|              1|
+----------+---------+---------------+---------+--------+---------------+

After you have a groupby and aggravation

val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))

This did not return all the columns you selected because after groupby and aggregation the only the used and result column are returned, As below

+----------+---------+-----------+
|tagShortID|Timestamp|maxAvgValue|
+----------+---------+-----------+
|         2|        2|          2|
|         1|        1|          1|
+----------+---------+-----------+

To get all the columns you need to join this two dataframes

averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp"))

and the final result will be

+----------+---------+---------------+---------+--------+---------------+-----------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|maxAvgValue|
+----------+---------+---------------+---------+--------+---------------+-----------+
|         2|        2|              2|        2|       2|              2|          2|
|         2|        2|              2|        2|       2|              2|          2|
|         2|        2|              2|        2|       2|              2|          2|
|         1|        1|              1|        1|       1|              1|          1|
|         1|        1|              1|        1|       1|              1|          1|
+----------+---------+---------------+---------+--------+---------------+-----------+

I hope this clears your confusion.

Upvotes: 1

Related Questions