Reputation: 205
What is the best practice in hive for bucketing a star schema model?
lets say I have a fact with 3 dimensions
f_test:
customer_key,
vendor_key,
country_key,
measures
d_customer
d_vendor
d_country
How would you bucket the above use case? Bucket dimensions based on keys and fact a composite bucket (customer, vendor, country) ?
Please advice on best practice.
Upvotes: 0
Views: 110
Reputation: 9788
Bucketing is used to improve query performance so without knowing how your users are going to query your data it is impossible to recommend how to bucket it e.g. if most queries of the fact table are by customer attributes then bucketing on customer_key makes sense.
Unless you have very high volumes of data in your Dims it's probably not worth bucketing them e.g. I assume Country only has ca. 200 records.
Unfortunately it's one of the main limitations of using Hive/Impala/etc. as an analytical platform in that you have very limited scope for improving performance via table design i.e. you can only partition/bucket a table in one way and therefore only support one query pattern. In your example, if your fact table is queried by customer and vendor equally there is no way to improve performance of both types of query and you just have to rely on the "horsepower" of the platform to process the queries.
Compare this to a conventional DB where you can just add a new index to support a query if needed
Upvotes: 1