Hive joining tables with partitions

Question

I need to optimize a query in Hive where I am joining tables using partitions but using two different partitions:

select *
from k1_core_dim_dl_tables.dl_item_dim a inner
join k1_repl_dl_tables.repl_managed_sku b on b.mds_fam_id = a.mds_fam_id
join k1_core_dim_dl_tables.dl_store_club_dim c on c.store_nbr = b.store_nbr inner
join k1_core_dim_dl_tables.dl_calendar_dim d on current_date = d.cal_dt
group by c.state_prov_cd, a.upc_nbr, a.item_nbr, a.signing_desc, a.dept_nbr, b.store_nbr, c.store_nm, d.cal_wk_nbr
limit 10;

the partitions would be:

for k1_core_dim_dl_tables.dl_item_dim, k1_repl_dl_tables.repl_managed_sku and k1_core_dim_dl_tables.dl_store_club_dim, partition op_cmpny_cd=WMT-K1

for k1_core_dim_dl_tables.dl_calendar_dim, partition geo_region_cd=K1

NickW · Accepted Answer

In Hive you can partition a table by a column (or group of columns), so you might partition a transaction table by date (or even by date and hour if it was very large). If you then write a query that filters by the partitioning columns it will only scan the in-scope partitions rather than the full table. So if you queried your transaction table where transaction date = 2020-12-04 it would only scan a single partition.

So partitioning can help with query performance but it has limitations, mainly that if you filter a query by anything other than partitioning column then you still need to do a full table scan e.g. query your transaction table where customer ='ABC' and amount > 1000 would do a full table scan

BTW - you can only have one partitioning structure per table

Hope this helps?

Hive joining tables with partitions

Answers (1)

Related Questions