Reputation: 7346
We have a table which is of size 100TB and we have multiple customers using the same table (i.e every customer uses different where conditions). Now the problem statement is every time a customer tries to query the table it gets scanned from top to bottom.
This creates lot of slowness for all the queries. We cannot even partition/bucket the table basing on any business keys. Can someone provide solution or point to similar problem statements and their resolution.
you can provide your suggestions as well as alternative technologies so that we can pick the best suitable one. Thanks.
Upvotes: 0
Views: 215
Reputation: 9067
My 2 cents: experiment with an ORC table with GZip compression (default) and clever partitioning / ordering...
With clever partitioning & clever ordering of the data at INSERT time, using the most-frequent filters, the pruning can be quite efficient.
Then you can look into optimizations such as using a non-default ORC stripe size, a non-default "bytes-per-reducer" threshold, etc.
Reference:
One last thing: with 15 nodes for running queries and a replication factor of 3, each HDFS block is available "locally" on 3 the nodes (20%) and "remotely" in the rest (80%). A higher replication factor may reduce I/O and network bottlenecks -- at the cost of disk space, of course.
Upvotes: 1