user3150024
user3150024

Reputation: 139

Which will give the best performance Hive or Pig or Python Mapreduce with text file and oracle table as source?

I have the below requirements and confused about which one to choose for high performance. I am not java developer. I am comfort with Hive, Pig and Python.

I am using HDP2.1 with tez engine. Data sources are text files(80 GB) and Oracle table(15GB). Both are structured data. I heard Hive will suite for structure data and Python map reduce streaming concept too will have high performance than hive & Pig. Please clarify.

I am using Hive and the reasons are:

Now the complete execution time from data copy into HDFS to final result taken 2.30 hrs with 4 node cluster using Hive and Python UDF.

My questions are:

1) I heard Java Mapreduce always faster. Will that be true with Python Map reduce streaming concept too?

2) Can I achieve all the above functions in Python like join, retrieval of text file name, compressed data flow like ORC since the volume is high?

3) Will Pig join would be better than Hive? If yes can we get input text file name in Pig to generate output column?

Thanks in advance.

Upvotes: 2

Views: 2542

Answers (1)

prateek05
prateek05

Reputation: 503

  1. Python Map Reduce or anything using Hadoop Streaming interface will most likely be slower. That is due to the overhead of passing data through stdin and stdout and the implementation of the streaming API consumer (in your case python). Python UDF's in Hive and Pig do the same thing.

  2. You might not want to compress data flow into ORC on the Python side. You'll be subjected to using Python's ORC libraries, which I am not sure if they are available. It would be easier if you let Python return your serialized object and the Hadoop reduce steps to compress and store as ORC (Python as a UDF for computation)

  3. Yes. Pig and Python have some what of a nice programmatic interface where in you can write python scripts to dynamically generate Pig Logic and submit it in parallel. look up Embedding Pig Latin in Python. It's robust enough to define Python UDFS and let Pig do the overall abstraction and job optimization. Pig does a lazy evaluation so in cases of multiple joins or multiple transformations it can demonstrate pretty good performance in the optimizing the complete pipe line.

You say HDP 2.1. Have you had a look at Spark ? If performance is important to you and looking at the datasets size which dont look huge you ll expect many time faster overall pipeline execution than Hadoop s native MR engine

Upvotes: 3

Related Questions