melissa maya
melissa maya

Reputation: 119

Join two dataframes with limiting the rows of one dataframe

I have two dataframes :

df1:
+--------------+---------------------+
|id_device     |tracking_time        |
+--------------+---------------------+
|20            |2020-02-19 02:37:45  |
|5             |2020-02-17 17:15:45  |
+--------------+---------------------+



df2
+--------------+----------------------+
|id_device     |tracking_time         |
+--------------+----------------------+
|20            | 2019-02-19 02:41:45  |
|20            |2020-01-17 17:15:45   |
+--------------+----------------------+

I want to get the following output :

+--------------+---------------------+------------------+
|id_device     |tracking_time        | df2.tracking_time |
+--------------+---------------------+------------------+
|20            |2020-02-19 02:37:45  |2019-02-19 02:41:45|
|5             |2020-02-17 17:15:45  |null               |
+--------------+---------------------+-------------------+

I tried the following code :

df1.registerTempTable("data");
    df2.createOrReplaceTempView("tdays");     
Dataset<Row> d_f = sparkSession.sql("select a.* , b.*  from data as a  LEFT JOIN (select  * from tdays ) as b  on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");

I get the following output :

+----------------------+---------------------+--------------------+------------------ -+
|id_device             |tracking_time        | b.id_device        |b.tracking_time     |
+----------------------+---------------------+--------------------+--------------------+
|20                     |2020-02-19 02:37:45 |20                  | 2019-02-19 02:41:45|
|20                     |2020-02-19 02:37:45 |20                  | 2020-01-17 17:15:45|
|5                      |2020-02-17 17:15:45 |null                |null                |
+-----------------------+--------------------+--------------------+--------------------+

What I want is to join the first dataframe with result of left join ordered by df2.tracking_time desc limit 1

I need your help

Upvotes: 1

Views: 47

Answers (1)

werner
werner

Reputation: 14845

Before the join, you can reduce df2 to the minimum dates for each id_device:

val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")

df2min contains only a single row with the minimum date from df2 per id. Therefore the left join will return the expected result.

Upvotes: 1

Related Questions