Most efficient design for these tables in Hadoop Hive?

Question

I have a huge dataset (1.5 TB compressed) already imported in HDFS. The data is divided in 3 collections: users, items and events. These collections are basically folders full of .bz2 files, each of which contains many records, in TAB separated format, one per row. The schemas are as follow:

users:

id: string
age: string
sex: string

items:

id: string
data: string (JSON format)

events:

user_id: string
event_id: string
time: bigint
item_id: string
pos: int
city: string
state: string
device: string
property: string
interaction: int

I have several different queries I want to run on this dataset, most of them involve joining events and users or events and items on the corresponding *_id fields (which are foreign keys to said collections).

I have already imported all this data into HDFS and created external tables. The queries work but are extremely expensive. Particularly when involving the events table, whose compressed .bz2 files sum up to about 1 TB. I have very little experience with Hive, but I've read about partitions, clusters and indices. It seems that with a clever design, I can speed my queries, especially for joins.

My question is, what is your suggestion for an efficient design of the three tables? The queries I need frequently involve one the following operations:

Joining events and users
Joining events and items
Joining all three tables
Grouping or filtering by age, sex, state, city, or property (not at the same time) and aggregating
Sorting and filtering by time and aggregating

I understand that perhaps not all of this can be accomplished efficiently with a single design, so I'm willing to create copies if necessary. I have around 18 TB of total available space in my cluster (I'm not accounting here for Hadoop's replication, so total space is actually less).

Data is static, i.e., it will not change in the future. For additional references, this is the Yahoo Y10 News Feed dataset.

Bonus: Ideally, if the data is stored in a format that is later easily accessible by Pig, even better!

54l3d · Accepted Answer

Performance tuning is very specific to each case, there is no general solution, so you have to try many ways to get the best perf.

Here is some tips based on my experience, they are sorted starting by the most important :

You need to buid your query that have a execution plan with the lowest number of MR jobs. If your query generate 4 MR jobs, it's not easy but its very efficient to make your query generate 3 MR jobs, you can use explain command for that.
Consider partitioning your tables.
A good choice for storage format for querying is ORC with SNAPPY compression.
Use Hive join optimization
Consider using Impala for querying and Hive for ETL.

There is a good idea that i didnt try it, its to make your dimension tables in HBase and the Fact one in Hive, you can read HBase data from Hive, see here.

Most efficient design for these tables in Hadoop Hive?

Answers (1)

Related Questions