Reputation: 861
Vectorization in Hive is a feature (available from Hive 0.13.0) that when enabled rather than reading one row at a time reads a block on 1024 rows. This improves the CPU Usage for operations like, scan, filter, join, and aggregations.
It's only available if data is stored in ORC format. So we don't talk about something other than ORC like LazySimple ...
OK, in most cases, enabling it set hive.vectorized.execution.enabled = true;
is cool. But I heard that in some particular cases, disabling it is better for both performance and cluster memory. Can anyone list them all here? Thank you.
In ours, disabling vectorized.execution
made our hive query 5 times faster and reduced the memory usage to 3-5%, compared with the previous one.
SET hive.vectorized.execution.enabled=false;
INSERT OVERWRITE TABLE target_table
SELECT
source_table.field1["key1"],
source_table.field2,
unix_timestamp()
FROM
source_table
WHERE
source_table.field3 = "valueX"
AND source_table.field4 = "valueY"
AND source_table.field5 IS NULL
AND source_table.field1["key1"] IS NOT NULL
GROUP BY
source_table.field1["key1"],
source_table.field2
Upvotes: 0
Views: 365