Reputation: 331
I have a hive table which is getting inserted few 1000s of record every hour. But when I execute select * from <table>
, it is taking so much time to execute. What is the reason behind this?
Upvotes: 0
Views: 2230
Reputation: 58
It is best practice to partition your data to speed up the queries. Partitioning will make hive run the query on the subset of the data instead of the entire dataset. Creating partitions may be done as follows:
The folder structure should look something like this:
path/to/directory/partition=partition_name
Then on the table itself (assuming it's on an external table) you're create table statement should be something like:
CREATE EXTERNAL TABLE table_name (
...
fields
...
)
PARTITIONED BY (partition)
LOCATION '/path/to/directory'
You can then query the table and treat the partition as another column.
Upvotes: 1
Reputation: 895
I agree with @cricket_007 of using execution engine tez/spark.There are some customization you can do from your end to achieve performance in hive:
Use of vectorization which executes in batches of 1024 rows at once
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Use of CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Upvotes: 1
Reputation: 191953
Hive is not fast to begin with... Not sure what you're expecting, but it will not be on the order of milliseconds.
If you want performance improvements, use Tez or Spark rather than MapReduce execution, also use Hive 2 w/ LLAP, and land the data in ORC or Parquet format.
If you aren't able to do the above, at least place data into hourly partitions. Then actually query against the partition rather than scanning all the rows/columns because Hive does partition pruning.
Also, HDFS doesn't like files smaller than the hdfs block size (128 MB). Anything smaller means wasted time in map tasks
Upvotes: 1