The Anh Nguyen
The Anh Nguyen

Reputation: 861

In which scenario, disable Hadoop vectorized execution better than enabling it

Vectorization in Hive is a feature (available from Hive 0.13.0) that when enabled rather than reading one row at a time reads a block on 1024 rows. This improves the CPU Usage for operations like, scan, filter, join, and aggregations.
It's only available if data is stored in ORC format. So we don't talk about something other than ORC like LazySimple ...

OK, in most cases, enabling it set hive.vectorized.execution.enabled = true; is cool. But I heard that in some particular cases, disabling it is better for both performance and cluster memory. Can anyone list them all here? Thank you.

Our case

In ours, disabling vectorized.execution made our hive query 5 times faster and reduced the memory usage to 3-5%, compared with the previous one.

SET hive.vectorized.execution.enabled=false;

INSERT OVERWRITE TABLE target_table
SELECT
    source_table.field1["key1"],
    source_table.field2,
    unix_timestamp()
FROM
    source_table
WHERE
    source_table.field3 = "valueX"
    AND source_table.field4 = "valueY"
    AND source_table.field5 IS NULL
    AND source_table.field1["key1"] IS NOT NULL
GROUP BY
    source_table.field1["key1"],
    source_table.field2

Upvotes: 0

Views: 365

Answers (0)

Related Questions