Reputation: 197
I have below mentioned dataset saved in parquet format, wanted to load new data as it comes and update this same file, say for example a new id comes in "3" using UNION i can add that particular new ID, but if same ID surfaces again with latest timestamp in last_updated column i just wanted to keep latest record. How can i achieve this using Apache Spark & Java.
+-------+------------+--------------------+---------+
| id|display_name| last_updated|is_active|
+-------+------------+--------------------+---------+
| 1| John|2018-07-23 08:32:...| true|
| 2| Tony|2018-07-22 20:32:...| true|
+-------+------------+--------------------+---------+
Upvotes: 0
Views: 1356
Reputation: 497
You can get the latest row by last_update column with "group by". For example, after union, you have a dataset like:
+-------+------------+--------------------+---------+
| id|display_name| last_updated|is_active|
+-------+------------+--------------------+---------+
| 1| John|2018-07-23 08:32:...| true|
| 2| Tony|2018-07-22 20:32:...| true|
| 2| Tony|2018-07-22 21:45:...| true|
+-------+------------+--------------------+---------+
Firstly you must load this data set to dataFrame. So, the SQL you should write:
select
t1.id, t1.display_name, t1.last_updated, t1.is_active
from
**your_temp_view** as t1
inner join (
select
id, max(last_updated) as max_last_updated
from
**your_temp_view**
group by id
) as t2 on t1.id = t2.id and t1.last_updated = t2.max_last_updated
Upvotes: 1