Reputation: 1877
I have a huge dataset (1.5 TB compressed) already imported in HDFS. The data is divided in 3 collections: users, items and events. These collections are basically folders full of .bz2
files, each of which contains many records, in TAB separated format, one per row. The schemas are as follow:
users:
id: string
age: string
sex: string
items:
id: string
data: string (JSON format)
events:
user_id: string
event_id: string
time: bigint
item_id: string
pos: int
city: string
state: string
device: string
property: string
interaction: int
I have several different queries I want to run on this dataset, most of them involve joining events and users or events and items on the corresponding *_id
fields (which are foreign keys to said collections).
I have already imported all this data into HDFS and created external tables. The queries work but are extremely expensive. Particularly when involving the events table, whose compressed .bz2
files sum up to about 1 TB. I have very little experience with Hive, but I've read about partitions, clusters and indices. It seems that with a clever design, I can speed my queries, especially for joins.
My question is, what is your suggestion for an efficient design of the three tables? The queries I need frequently involve one the following operations:
age
, sex
, state
, city
, or property
(not at the same time) and aggregating time
and aggregatingI understand that perhaps not all of this can be accomplished efficiently with a single design, so I'm willing to create copies if necessary. I have around 18 TB of total available space in my cluster (I'm not accounting here for Hadoop's replication, so total space is actually less).
Data is static, i.e., it will not change in the future. For additional references, this is the Yahoo Y10 News Feed dataset.
Bonus: Ideally, if the data is stored in a format that is later easily accessible by Pig, even better!
Upvotes: 0
Views: 749
Reputation: 3973
Performance tuning is very specific to each case, there is no general solution, so you have to try many ways to get the best perf.
Here is some tips based on my experience, they are sorted starting by the most important :
There is a good idea that i didnt try it, its to make your dimension tables in HBase and the Fact one in Hive, you can read HBase data from Hive, see here.
Upvotes: 2