Reputation: 77
I need to optimize a query in Hive where I am joining tables using partitions but using two different partitions:
select *
from k1_core_dim_dl_tables.dl_item_dim a inner
join k1_repl_dl_tables.repl_managed_sku b on b.mds_fam_id = a.mds_fam_id
join k1_core_dim_dl_tables.dl_store_club_dim c on c.store_nbr = b.store_nbr inner
join k1_core_dim_dl_tables.dl_calendar_dim d on current_date = d.cal_dt
group by c.state_prov_cd, a.upc_nbr, a.item_nbr, a.signing_desc, a.dept_nbr, b.store_nbr, c.store_nm, d.cal_wk_nbr
limit 10;
the partitions would be:
for k1_core_dim_dl_tables.dl_item_dim, k1_repl_dl_tables.repl_managed_sku and k1_core_dim_dl_tables.dl_store_club_dim, partition op_cmpny_cd=WMT-K1
for k1_core_dim_dl_tables.dl_calendar_dim, partition geo_region_cd=K1
Upvotes: 0
Views: 914
Reputation: 9768
In Hive you can partition a table by a column (or group of columns), so you might partition a transaction table by date (or even by date and hour if it was very large). If you then write a query that filters by the partitioning columns it will only scan the in-scope partitions rather than the full table. So if you queried your transaction table where transaction date = 2020-12-04 it would only scan a single partition.
So partitioning can help with query performance but it has limitations, mainly that if you filter a query by anything other than partitioning column then you still need to do a full table scan e.g. query your transaction table where customer ='ABC' and amount > 1000 would do a full table scan
BTW - you can only have one partitioning structure per table
Hope this helps?
Upvotes: 1