Reputation:
Suppose, hypothetically that I have a star schema in a data warehouse setting. There is one VERY, VERY long fact table (think billions to trillions of rows) and several low cardinality dimension tables (think 100 dimension tables). Each fact table foreign key that points to a dimension table primary key is bitmap indexed. Each dimension table primary key is also bitmap indexed. This is all for fast joins. All pretty standard.
Suppose that the data warehouse is starting to show performance degradation. The time it takes to return results from a bitmapped join is getting worse the longer the fact table gets. The business requirement is that the fact table keeps growing (we can not move data older than a year off to archival storage)
I'm thinking of the following solutions:
Has anyone ever done this before?
Does anyone have any tips for solution #3?
Is the HBASE solution optimal as far as scaling up with fast reads?
As far as writes, I don't care for fast writes as they would be done off hours as batch processes.
If anyone has chosen solution 1 or 2, has anyone used a consistent hashing algorithm (to avoid remapping as in a plain old hash if more partitions, hash keys are created dynamically)? Dynamic growth in the number of partitions without a full remap is probably not an option (I haven't seen it done in practice as far as partitioned tables are concerned) so it just seems to me that any partition solution would lead to scaling problems.
Any thoughts, advice and experience with moving a giant fact table with many dimensions (a traditional DW star schema) to an HBASE giant dimensionless table?
Related question:
How do aggregate data collections that traditionally reside in materialized views (or alternately as separate fact tables linked to the same dimensions as the most granular fact table -- i.e. hourly/daily/weekly/monthly where the base fact table is hourly) in a data warehouse map across to HBASE?
My thoughts are that since there are no materialized views in HBASE, the aggregate data collections are stored as HBASE tables that get updated/inserted into any time there are changes to the most granular, lowest level fact table.
Any thoughts on aggregate tables in HBASE? Has anyone used Hive scripts to essentially mimic the behavior of materialized views in updating aggregate column data in secondary HBASE tables that have aggregate data stored in them (i.e. daily_aggregates_fact_table, weekly_aggregates_fact_table, monthly_aggregates_fact_table) on a change to the most granular fact table?
Upvotes: 4
Views: 6858
Reputation: 584
Dimension will be defined as keyrow in HBase. The value is your measure value. If your fact tables are factless, the value in HBase row can be null.
Depends on the poor resources from Internet, I think the idea is:
**RowKey** **Value**
DimensionA XX
DimensionA:DimensionB XX
DimensionB:DimensionC XX
DimenesionA:DimensionB:DimenesionC: XXX
Is it suitable for your problems?
Upvotes: 1
Reputation: 25919
HBase is not a good choice for a general purpose data warehouse (with real-timish query times) Any single table will only allow you to drill down along one dimension or along one path through dimensions (if you design the right composite key right). It isn't undoable (e.g. ebay built their new search engine on HBase) but it isn't out of the box
There are several efforts to provide high-performance SQL over Hadoop (e.g. Hadapt or Rainstor ) but they won't give you the performance of a good massively parallel databases like Vertica, Greenplum, Asterdata, Netezza and the like
Upvotes: 0