Reputation: 17
I am creating an external Hive table on a parquet file on S3. The commands look like
CREATE EXTERNAL TABLE userinfo(
user_id string,
last_name string,
first_name string
)
PARTITIONED BY (
yr string,
mo string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://mybucket/basedir/'
TBLPROPERTIES (
'serialization.null.format'='');
alter table userinfo add IF NOT EXISTS partition (yr='2021', mo='07');
At this point, if I run "select count(*) from userinfo", I get 0 as the result. But if I then run
ANALYZE TABLE userinfo PARTITION(yr='2021', mo='07') COMPUTE STATISTICS;
and rerun the "select count(*)..." I get the expected row count.
This isn't a show-stopper, but it makes me think I'm doing something/failing to do something that's causing this strange behavior. Any insights are welcome.
Upvotes: 1
Views: 1028
Reputation: 7387
You are not telling optimizer the statistics of the table so it shows 0 because it doesnt know anything about the data in it.
Ideally you should gather statistics whenever you load data into a table/partition. Stale statistics or no statistics will keep optimizer thinking/guessing and thus making queries run longer/consume more memory and sometime return wrong values.
Now, you can setup this property in hive for auto stats gathering -
hive.stats.autogather=true
Or you can gather them manually right after loading data into it.
ANALYZE TABLE tab COMPUTE STATISTICS;
Upvotes: 1