Reputation: 461
We have a simple query running on a table/view which is approx of size 5 TB. We are performing ETL and finally adding the data to the core table by adding a partition.
But as the data we are processing is huge, the query spawns 4000+ mappers and 1000+ reducers. Query also runs for 40+ mins.
How can I improve /reduce resource utilization?
Query:
insert overwrite table dss.prblm_mtrc partition (LOAD_DT) select *, '2019-01-02' as LOAD_DT from dss.v_prblm_mtrc_stg_etl
Upvotes: 1
Views: 1860
Reputation: 38290
Use static partition, in case there are already many partitions in target table, Hive will scan them faster before final load, see also this: HIVE Dynamic Partitioning tips
insert overwrite table dss.prblm_mtrc partition (LOAD_DT='2019-01-02') select * from dss.v_prblm_mtrc_stg_etl
If your source table has too many small files, try to reduce them, use concatenate for orc files or use merge during loading source table
Use vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
All vectorizing settings are here.
And if you target table contains many already loaded partitions, try to switch off statistics auto gathering, this setting may speed-up loading:
set hive.stats.autogather=false;
Also if your source table is orc and splits calculation takes too long time, try this setting:
hive.exec.orc.split.strategy=BI;
More about ORC strategy: ORC split strategy
Upvotes: 1