Reputation: 5785
I'm thinking of using HBase to store logs (web log data), each log would have about 20 different values (let's say columns), I want to run queries that filter results based on those columns.
My initial idea was to save each log (cell) multiple times under each column which is value of each field in log. This would cause about 20x increase in data size, but I think this gives good increase in performance. Row-key would be timestamp with prefix which is source id.
Each source will generate about 40-100M log lines (there might be tens of thousand of sources).
I also need low latency, possibly below 10 seconds (so solutions like Hive are currently not a option)
Do you think this is right schema design? If not what to you think would be right one, or maybe I should use something else (what)?
Thanks for all your answers.
Upvotes: 4
Views: 4136
Reputation: 6424
We're doing something similar with weblogs. We're doing something slightly more complicated than the case you present but I can see similarities in issues that could be encountered.
We created tables in hive to store the various data we are collecting then have a job to run queries and load that data into tables in HBase pre-aggregated.
This helps reduce the level of data increase and duplication as the raw data is only stored once, then the aggregations you want are stored. Using Hive to store raw data allows greater ease in flexibility to aggregate by different dimensions and various manipulations of the data.
Depending on what your specific goals are, HBase might be the only requirement for storage, but if the goal is to aggregate and analyze data, I think Hive and HBase would work together better.
If your results are not needed 'real time' then just using hive to store the raw data and generating reports from a query may also be an acceptable solution.
I am, by no means, a definitive resource on setups for the HStack. I wasn't even a key member in the design of our existing system. I have encountered a situation where we couldn't store data in hbase and retrieve it while maintaining an optimal setup/organization for hbase. The method we needed to store data to retrieve it would result in a lot of headaches in other areas.
I hope my ramblings have provided some help in some fashion. :)
Upvotes: 4