Aakash Basu
Aakash Basu

Reputation: 1767

Spark SQL Data Warehousing

I've two datasets, as shown below -

Historical:

+-------+-------+-----------+
|prod_id|prod_cd|  prod_desc|
+-------+-------+-----------+
|  42546|Firefox|Firefox 4.0|
|  12432| Chrome| Chrome 1.0|
+-------+-------+-----------+

Current:

+-------+-------+----------+
|prod_id|prod_cd| prod_desc|
+-------+-------+----------+
|  53243|     IE|    IE 2.0|
|  12432| Chrome|Chrome 1.5|
+-------+-------+----------+

I want to run a (spark) sql query on the two tables (dataframes) and get all the historical and the new entries where a common between both will pick from the latest table and remove the old table (similar to update those respective rows).

So, my output table should be like -

+-------+-------+-----------+
|prod_id|prod_cd|  prod_desc|
+-------+-------+-----------+
|  42546|Firefox|Firefox 4.0|
|  53243|     IE|     IE 2.0|
|  12432| Chrome| Chrome 1.5|
+-------+-------+-----------+

I did it for reconciliation jobs during incremental/delta loads after the initial load is done on day 0, but since a long time I've no touch with Data Warehousing and hence lost that clarity.

Do I need to use rank function after union? Or is there a better/faster approach? A help would be highly obliged.

Upvotes: 0

Views: 137

Answers (2)

Siddhartha Sanyal
Siddhartha Sanyal

Reputation: 11

For the question you have posted, using rank will not make sense because there is no column which indicates a row to be new. For example a timestamp.

So for the case you have mentioned, you can use the merge logic.

You know the table which has the newest data and the table which has the historical data. So your logic should be to merge both the tables.

Below is one way you can do in Spark SQL

select
case when(c.prod_id is null) then h.prod_id else c.prod_id end as prod_id,
case when(c.prod_id is null) then h.prod_cd else c.prod_cd end as prod_cd,
case when(c.prod_id is null) then h.prod_desc else c.prod_desc end as prod_desc
from historical h
full outer join current c on
h.prod_id = c.prod_id 

If data exists in the current it will extract from current or else it will extract from historical. I have joined the tables on the primary key "prod_id".

Hope this helps!!

Upvotes: 0

Yogesh Sharma
Yogesh Sharma

Reputation: 50163

You can use not exists with union :

select prod_id, prod_cd, prod_desc
from Current c
union all
select prod_id, prod_cd, prod_desc
from Historical h
where not exists (select 1 from Current c1 where c1.prod_id = h.prod_id);

Sort word of understanding :

With the preceding query this would return first Current data and then return Historical data (i.e. data which are not present in current ) with help of correlated subquery.

So, this would return two result sets which combined via set operator UNION ALL to produce desired result.

Upvotes: 2

Related Questions