Hive Partition Performance

Question

I would like to ask about Hive partitioning performance. How many records (in term of rows) would I need in order to see the partitioning performance?

Currently, I have 2 million+ records and I have split the table into 2 partitions. My partitioning condition is as follow:

ADD PARTITION (year_month=’2017_07’)
ADD PARTITION (year_month=’2017_08’)
INSERT OVERWRITE TABLE T PARTITION (year_month='2017_07') SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
INSERT OVERWRITE TABLE T PARTITION (year_month='2017_08') SELECT * FROM T WHERE st_time >= '2017_08_01 00:00:00.0';

When I tried to do some performance testing between the partitioned table and non-partitioned table, the performance of both tables is relatively the same. My performance testing query is as follows:

SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
SELECT * FROM non_part_table WHERE st_time < '2017_08_01 00:00:00.0';

Am I missing out on some important points?

Also, does partition work on derived column/function? E.g. date_format(st_time, yyyy-MM)

Any help or advice is greatly appreciated.

OneCricketeer · Accepted Answer

the performance of both tables is relatively the same

You forgot the WHERE clause against the partition. Partitions only improve performance when you are selecting data within them.

SELECT * FROM T 
WHERE year_month = '2017_07'
-- AND st_time < '2017_08_01 00:00:00.0'
;

Without this, you're still scanning the whole table for the st_time values.

You can add an EXPLAIN to the queries to see the difference

You'll get additional performance improvements by converting data into Parquet or ORC

Hive Partition Performance

Answers (1)

Related Questions