Which will give the best performance Hive or Pig or Python Mapreduce with text file and oracle table as source?

Question

I have the below requirements and confused about which one to choose for high performance. I am not java developer. I am comfort with Hive, Pig and Python.

I am using HDP2.1 with tez engine. Data sources are text files(80 GB) and Oracle table(15GB). Both are structured data. I heard Hive will suite for structure data and Python map reduce streaming concept too will have high performance than hive & Pig. Please clarify.

I am using Hive and the reasons are:

need to join those two sources based on one column.
using ORC format table to store the join results since the data size is huge
text file name will be used to generate one output column and that has been performed with virtual column concept input__file__name field.
After join need to do some arithmetic operations on each row and doing that via python UDF

Now the complete execution time from data copy into HDFS to final result taken 2.30 hrs with 4 node cluster using Hive and Python UDF.

My questions are:

1) I heard Java Mapreduce always faster. Will that be true with Python Map reduce streaming concept too?

2) Can I achieve all the above functions in Python like join, retrieval of text file name, compressed data flow like ORC since the volume is high?

3) Will Pig join would be better than Hive? If yes can we get input text file name in Pig to generate output column?

Thanks in advance.

Which will give the best performance Hive or Pig or Python Mapreduce with text file and oracle table as source?

Answers (1)

Related Questions