Sanchit Grover
Sanchit Grover

Reputation: 1008

Hive table with multiple partitions

I have a table (data_table) with multiple partition columns year/month/monthkey.

Directories look something like year=2017/month=08/monthkey=2017-08/files.parquet

Which of the below queries would be faster?

select count(*) from data_table where monthkey='2017-08'

or

select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'

I think the initial time taken by hadoop take to find the required directories in the first case would be more. But want to confirm

Upvotes: 2

Views: 12678

Answers (2)

David דודו Markovitz
David דודו Markovitz

Reputation: 44981

Finding the relevant partitions is a metastore operation and not a file system operation.
It is done by querying the metasore and not by scanning the directories.
The metasore query of the first use-case will most likely be faster than the second use-case but in any case we are talking here on fractions of a second.

Demo

create external table t100k(i int)
partitioned by (x int,y int,xy string)
;

explain dependency select count(*) from t100k where xy='100-1000';

The query that was issued against the metastore:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"  
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 't100k'   
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = 'local_db' 
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2 
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
 

explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';

The query that was issued against the metastore:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"  
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 't100k'   
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = 'local_db' 
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1 
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2 
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100) 
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))  
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )

Upvotes: 5

ForeverLearner
ForeverLearner

Reputation: 2113

Since comment will change the formatting, hence posting here. Kindly accept @Dudu's reply. Please execute the below on metastore DB (mysql in my case):

mysql> select part_id, location, tbl_id, part_name from PARTITIONS as P inner join SDS as S on P.SD_ID = S.SD_ID where P.TBL_ID = 472;
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| part_id | location                                                                | tbl_id | part_name                            |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
|       7 | hdfs://hostname:8020/tmp/multi_part/2011/01/2011-01 |    472 | year=2011/month=1/year_month=2011-01 |
|       9 | hdfs://hostname:8020/tmp/multi_part/2012/01/2012-01 |    472 | year=2012/month=1/year_month=2012-01 |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
2 rows in set (0.00 sec)

The location from both the queries will pull data from same hdfs directory. The only difference in speed will be from the metastore DB query that is already explained in Dudu's answer.

Upvotes: 0

Related Questions