Reputation: 57
I would like to ask about Hive partitioning performance. How many records (in term of rows) would I need in order to see the partitioning performance?
Currently, I have 2 million+ records and I have split the table into 2 partitions. My partitioning condition is as follow:
ADD PARTITION (year_month=’2017_07’)
ADD PARTITION (year_month=’2017_08’)
INSERT OVERWRITE TABLE T PARTITION (year_month='2017_07')
SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
INSERT OVERWRITE TABLE T PARTITION (year_month='2017_08')
SELECT * FROM T WHERE st_time >= '2017_08_01 00:00:00.0';
When I tried to do some performance testing between the partitioned table and non-partitioned table, the performance of both tables is relatively the same. My performance testing query is as follows:
SELECT * FROM T WHERE st_time < '2017_08_01 00:00:00.0';
SELECT * FROM non_part_table WHERE st_time < '2017_08_01 00:00:00.0';
Am I missing out on some important points?
Also, does partition work on derived column/function? E.g. date_format(st_time, yyyy-MM)
Any help or advice is greatly appreciated.
Upvotes: 2
Views: 673
Reputation: 191953
the performance of both tables is relatively the same
You forgot the WHERE clause against the partition. Partitions only improve performance when you are selecting data within them.
SELECT * FROM T
WHERE year_month = '2017_07'
-- AND st_time < '2017_08_01 00:00:00.0'
;
Without this, you're still scanning the whole table for the st_time
values.
You can add an EXPLAIN
to the queries to see the difference
You'll get additional performance improvements by converting data into Parquet or ORC
Upvotes: 2