underwood
underwood

Reputation: 943

Performance of Group By on Partition Column in Hive

I have a table with 4 columns with col4 as the partition column in Hive. This is a huge table with ~9M rows inserted every 5 hours. I have a restriction that I cannot change the design of this table as it is used for other reports as well.

CREATE EXTERNAL TABLE testdb.table1(col1 string, col2 int, col3 int)
PARTITIONED BY (col4 string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/input/';

For one of the use cases, I'm trying to create a lookup table to identify some data points for each value in col4 like

select col1, col4 from testdb.table1 group by col1, col4;

Questions:

  1. Will there be any performance hit by doing a Group By on partitioned column? Col4 in this case.

  2. Will it still be a full table scan when I do a Group By on col4?

  3. What is the best way to create a lookup on a partitioned column with minimal full table scan?

I came across this post but it was for Impala. I'm not sure if Impala and Hive use the same MR Engine internally for data processing. So posting this as a new question.

Upvotes: 1

Views: 2136

Answers (1)

leftjoin
leftjoin

Reputation: 38335

  1. Switch on mapper-side aggregation for the best performance:

set hive.map.aggr=true;

and compare performance with partitioned and non-partitioned column. In case of partitioned column the data is already partially grouped (files belong to single partition) and map-side aggregation will be performed a bit faster because mappers will create less groups each and less data will be passed to reducers. In other words, all the records needed to perform aggregation is inside the single partition and each file contains only one group (few groups if group by non-partition column also). But over-partitioning may result in too many files and performance degradation.

  1. Why shouldn't it be a full scan if your query requires full scan? Yes it will be full scan for sure because you have no WHERE clause and selecting not only partition column.

  2. Full table scan means no partition pruning. Add WHERE condition on partition column to restrict partitions. If you use only partition columns in the query, technically it can use metadata only, but this is rare case and not like your query.

You can benefit from partitioning best if you are using partition in the WHERE clause.

Using partitioning and DISTRIBUTE+SORT while loading compressed and ORC table you can significantly reduce the size of compressed files (2x or even more), but I have never noticed SIGNIFICANT performance gain from it.

Upvotes: 1

Related Questions