wandermonk
wandermonk

Reputation: 7346

Query optimization for hive table

We have a table which is of size 100TB and we have multiple customers using the same table (i.e every customer uses different where conditions). Now the problem statement is every time a customer tries to query the table it gets scanned from top to bottom.

This creates lot of slowness for all the queries. We cannot even partition/bucket the table basing on any business keys. Can someone provide solution or point to similar problem statements and their resolution.

you can provide your suggestions as well as alternative technologies so that we can pick the best suitable one. Thanks.

Upvotes: 0

Views: 215

Answers (1)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

My 2 cents: experiment with an ORC table with GZip compression (default) and clever partitioning / ordering...

  • every SELECT that uses a partition key in its WHERE clause will do "partition pruning" and thus avoid to scan everything [OK, OK, you said you had no good candidate in your specific case, but in general it can be done so I had to mention it first]
  • then within each ORC file in scope, the min/max counters will be checked for "stripe pruning", limiting the I/O further

With clever partitioning & clever ordering of the data at INSERT time, using the most-frequent filters, the pruning can be quite efficient.

Then you can look into optimizations such as using a non-default ORC stripe size, a non-default "bytes-per-reducer" threshold, etc.

Reference:

  1. http://fr.slideshare.net/oom65/orc-andvectorizationhadoopsummit
  2. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
  3. https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
  4. http://thinkbig.teradata.com/hadoop-performance-tuning-orc-snappy-heres-youre-missing/

One last thing: with 15 nodes for running queries and a replication factor of 3, each HDFS block is available "locally" on 3 the nodes (20%) and "remotely" in the rest (80%). A higher replication factor may reduce I/O and network bottlenecks -- at the cost of disk space, of course.

Upvotes: 1

Related Questions