James
James

Reputation: 57

Hive Partition Performance

I would like to ask about Hive partitioning performance. How many records (in term of rows) would I need in order to see the partitioning performance?

Currently, I have 2 million+ records and I have split the table into 2 partitions. My partitioning condition is as follow:

When I tried to do some performance testing between the partitioned table and non-partitioned table, the performance of both tables is relatively the same. My performance testing query is as follows:

Am I missing out on some important points?

Also, does partition work on derived column/function? E.g. date_format(st_time, yyyy-MM)

Any help or advice is greatly appreciated.

Upvotes: 2

Views: 673

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191953

the performance of both tables is relatively the same

You forgot the WHERE clause against the partition. Partitions only improve performance when you are selecting data within them.

SELECT * FROM T 
WHERE year_month = '2017_07'
-- AND st_time < '2017_08_01 00:00:00.0'
;

Without this, you're still scanning the whole table for the st_time values.

You can add an EXPLAIN to the queries to see the difference

You'll get additional performance improvements by converting data into Parquet or ORC

Upvotes: 2

Related Questions