clairvoyant
clairvoyant

Reputation: 205

How we do bucketing in hive for star schema

What is the best practice in hive for bucketing a star schema model?

lets say I have a fact with 3 dimensions

f_test:

customer_key,
vendor_key,
country_key,
measures


d_customer
d_vendor
d_country

How would you bucket the above use case? Bucket dimensions based on keys and fact a composite bucket (customer, vendor, country) ?

Please advice on best practice.

Upvotes: 0

Views: 110

Answers (1)

NickW
NickW

Reputation: 9788

Bucketing is used to improve query performance so without knowing how your users are going to query your data it is impossible to recommend how to bucket it e.g. if most queries of the fact table are by customer attributes then bucketing on customer_key makes sense.

Unless you have very high volumes of data in your Dims it's probably not worth bucketing them e.g. I assume Country only has ca. 200 records.

Unfortunately it's one of the main limitations of using Hive/Impala/etc. as an analytical platform in that you have very limited scope for improving performance via table design i.e. you can only partition/bucket a table in one way and therefore only support one query pattern. In your example, if your fact table is queried by customer and vendor equally there is no way to improve performance of both types of query and you just have to rely on the "horsepower" of the platform to process the queries.

Compare this to a conventional DB where you can just add a new index to support a query if needed

Upvotes: 1

Related Questions