Kashyap
Kashyap

Reputation: 17466

Optimal query/join for joining scd-type-2 dimension with facts for reporting/aggregation

I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year.

I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom: Use range join optimization: This query has a join condition that can benefit from range join optimization. To improve performance, consider adding a range join hint. and points to this link.

Question: Is there a more optimal way to query when I'm joining using a between condition (instead of = condition)?

fact table:

create table sales
(name             string
,sale_date        date
,sold_amt         long);

insert into sales values
('John','2022-02-02',100),
('John','2022-03-03',100),
('John','2023-02-02',200),
('John','2023-03-03',200),
('Rick','2022-02-02',300),
('Rick','2023-02-02',400);

dimension table (scd-type-2)

create table employee_scd2
(name        string
,region      string
,start_date  date,
,end_date    date,
,is_current  boolean); -- unused, kept for completeness

insert into employee_scd2 values
('John','NAM',  '2010-01-01', '2022-12-31', false),
-- John transferred from NAM to APAC starting 2023
('John','APAC', '2023-01-01', '9999-01-01', true),
('Rick','NAM',  '2020-01-01', '9999-12-31', true);

sales report by region and year

select e.region,
       year(s.sale_date) as sale_year,
       SUM(s.sold_amt)   as sale_amt
from       sales s
left join  employee_scd2 e
  on     e.name       = s.name
  and    s.sale_date between e.start_date and e.end_date
group by e.region, year(s.sale_date);

I've read following and some more:

Upvotes: 1

Views: 1201

Answers (0)

Related Questions