Pyd
Pyd

Reputation: 6159

Sql query without using inner join in it

I have a spark Dataset inputDS Dataset<Row> like below

  +---------------+---------------+----------------+-------+--------------+--------+
  |  time         | thingId       |     controller | module| variableName |  value |
  +---------------+---------------+----------------+-------+--------------+--------+
  |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
  |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
  +---------------+---------------+----------------+-------+--------------+--------+

Generated by STEP 1

Dataset<Row> inputDS = readInput.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");

Expected output

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188639406  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

Max(time) column for that thingId,controller,module and variableName

End goal is to get last updatedvalue for every thingId, controller,module and variableName based on MAX(time) column.

Code

inputDS.createOrReplaceTempView("intermediate");

Dataset<Row> outputDS = spark.sql("select B.time,A.thingId,A.controller,A.module,A.variableName,A.value from intermediate A 
inner join (select thingId,controller,module,MAX(time)time from intermediate group by thingId,controller,module) B 
on A.thingId=B.thingId and A.controller=B.controller and A.module=B.module");

SQL query works as expected but using inner join doesn't look efficient

1) Is there is any other efficient way we could get the expected output without inner join or equivalent where condition.

2) it would be great if we could able to get the expected output from STEP 1

 Dataset<Row> intermediate = inputDS.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");

Upvotes: 2

Views: 84

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522762

Here is a variation of your current join query, which relies on ROW_NUMBER:

SELECT time, thingId, controller, module, variableName, "value"
FROM
(
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY thingId, controller, module
                                   ORDER BY time DESC) rn
    FROM intermediate
) t
WHERE rn = 1;

Analytic functions usually can beat out an older school approach such as a join.

Upvotes: 2

Related Questions