Dhiraj
Dhiraj

Reputation: 3696

How to make a Hive query take advantage of statistics stored in Metastore

I am using Hive version 1.2.1. If I run select count(*) from mytable, I see that it launches a Tez job. So obviously it is not using any table statistics information because ideally number of rows should be fetched from table statistics stored in the Hive Metastore. Also I explicitly checked all the tables in the Hive Metastore, and I didn't find any table name there, which would sort of suggest that it stores table statistics. The only next best relevant table that I could see was TAB_COL_STATS, but this table only stores column level statistics and that too it only had very handful number of rows, the table had 10 rows. This poses two questions.

  1. Does this version of Hive (1.2.1) not support table statistics?
  2. If this Metastore table i.e. TAB_COL_STATS stores everything, why any column like num_rows is not part of this table schema? I see max,min,avg,num_distinct etc type of columns.
  3. When I query a table for some statistics (e.g. number of rows), do I have to switch on some option so it will take advantage of internally stored statistics instead of running a Tez job?

Upvotes: 2

Views: 1472

Answers (1)

damientseng
damientseng

Reputation: 553

  1. It supports table level statistics.
  2. Hive metastore tables are highly normalized. You can find info about num of rows in TABLE_PARAMS or PARTITION_PARAMS.
  3. You should set hive.compute.query.using.stats to true to make use of metadata for queries like select count(*)....

    But beforehand, make sure those statistics actually exists.

    If not, run analyze table mytable compute statistics to gather it first.

    Or you can set hive.stats.autogather to true to enforce the gathering when data is inserted into tables.

Upvotes: 2

Related Questions