Sql query without using inner join in it

Question

I have a spark Dataset inputDS Dataset like below

  +---------------+---------------+----------------+-------+--------------+--------+
  |  time         | thingId       |     controller | module| variableName |  value |
  +---------------+---------------+----------------+-------+--------------+--------+
  |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
  |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
  +---------------+---------------+----------------+-------+--------------+--------+

Generated by STEP 1

Dataset inputDS = readInput.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");

Expected output

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188639406  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

Max(time) column for that thingId,controller,module and variableName

End goal is to get last updatedvalue for every thingId, controller,module and variableName based on MAX(time) column.

Code

inputDS.createOrReplaceTempView("intermediate");

Dataset outputDS = spark.sql("select B.time,A.thingId,A.controller,A.module,A.variableName,A.value from intermediate A 
inner join (select thingId,controller,module,MAX(time)time from intermediate group by thingId,controller,module) B 
on A.thingId=B.thingId and A.controller=B.controller and A.module=B.module");

SQL query works as expected but using inner join doesn't look efficient

1) Is there is any other efficient way we could get the expected output without inner join or equivalent where condition.

2) it would be great if we could able to get the expected output from STEP 1

 Dataset intermediate = inputDS.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");

Tim Biegeleisen · Accepted Answer

Here is a variation of your current join query, which relies on ROW_NUMBER:

SELECT time, thingId, controller, module, variableName, "value"
FROM
(
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY thingId, controller, module
                                   ORDER BY time DESC) rn
    FROM intermediate
) t
WHERE rn = 1;

Analytic functions usually can beat out an older school approach such as a join.

Sql query without using inner join in it

Answers (1)

Related Questions