Jack
Jack

Reputation: 197

Incrementally load data in parquet file using Apache Spark & Java

I have below mentioned dataset saved in parquet format, wanted to load new data as it comes and update this same file, say for example a new id comes in "3" using UNION i can add that particular new ID, but if same ID surfaces again with latest timestamp in last_updated column i just wanted to keep latest record. How can i achieve this using Apache Spark & Java.

+-------+------------+--------------------+---------+
|     id|display_name|        last_updated|is_active|
+-------+------------+--------------------+---------+
|      1|        John|2018-07-23 08:32:...|     true|
|      2|        Tony|2018-07-22 20:32:...|     true|
+-------+------------+--------------------+---------+

Upvotes: 0

Views: 1356

Answers (1)

lvnt
lvnt

Reputation: 497

You can get the latest row by last_update column with "group by". For example, after union, you have a dataset like:

+-------+------------+--------------------+---------+
|     id|display_name|        last_updated|is_active|
+-------+------------+--------------------+---------+
|      1|        John|2018-07-23 08:32:...|     true|
|      2|        Tony|2018-07-22 20:32:...|     true|
|      2|        Tony|2018-07-22 21:45:...|     true|
+-------+------------+--------------------+---------+

Firstly you must load this data set to dataFrame. So, the SQL you should write:

select 
  t1.id, t1.display_name, t1.last_updated, t1.is_active
from 
  **your_temp_view** as t1 
  inner join (
    select 
      id, max(last_updated) as max_last_updated
    from
      **your_temp_view**
    group by id
  ) as t2 on t1.id = t2.id and t1.last_updated = t2.max_last_updated

Upvotes: 1

Related Questions