Join performance on AWS elastic map reduce running hive

Question

I am running a simple join query

 select count(*) from t1 join t2 on t1.sno=t2.sno

Table t1 and t2 both have 20 million records each and column sno is of string data type.

The table data is imported in to HDFS from Amazon s3 in rcfile format. The query took 109s with 15 Amazon large instances however it takes 42sec on sql server with 16 GB RAM and 16 cpu cores.

Am I missing anything? Can't understand why am I getting slow performance on Amazon?

Matthew Rathbone · Accepted Answer

Some questions to help you tune Hadoop performance:

What does your IO utilization look like on those instances? Maybe large instances are not the right balance of CPU / Disk / Memory for the job.
How are your files stored? Is it a single file, or many small files? Hadoop isn't so hot with many small files, even if they're combinable
How many reducers did you run? You want to have about 0.9*totalReduceCapacity as ideal
How skewed is your data? If there are many records with the same key they will all go to the same reducer, and you'll have O(n*n) upper bound in that reducer if you're not careful.

sql-server might be fine with 40mm records, but wait till you have 2bn records and see how it does. It will probably just break. I'd see hive more as a clever wrapper for Map Reduce rather than an alternative to a real database.

Also from experience I think having 15 c1.mediums might perform just as well as the large machines, if not better. the large machines don't have the right balance of CPU/Memory honestly.

Join performance on AWS elastic map reduce running hive

Answers (1)

Related Questions