Reputation: 599
I wanted to know that what is the performance gain or loss if I use pig in local mode (which internally calls Map reduce) vs using PIG-withouthadoop.jar file.?
Does PIG-withouthadoop.jar really does not use hadoop ???
And If I only want to use Pig without clusters, like design a data flow, then what should I use,? Pig in local mode OR pig-withouthadoop.jar file??
Currently I have written my script using pig local mode and while trying to deploy in server and set up PIG in local mode, I think I also need HADOOP_HOME to be set in the environment variables before setting the PIG_HOME variable
Kindly advice ..
Thanks in advance. :)
Upvotes: 0
Views: 828
Reputation: 545
Let me answer your question in a sequence:
1) When we talk about performance, then if we assume the file size and the Pig script to be constant, while running in local mode and Hadoop mode. Then, definitely the processing will be faster in local mode as all the task is getting performed in a single JVM and but in case of Hadoop mode, the input file will be carried to the data nodes, then the Pig script or UDFs will also get carried to the cluster. This will demand more time, although, in both the cases the pig scripts and UDFs will internally get converted to map and reduce task and also the number of map and reduce class constructed will always be same in both the cases. We can check this by using EXPLAIN command.
2) No. Pig internally contains a bundle of Hadoop jars. So, if you haven't started the Hadoop by using start-all.sh command, pig will work as it uses the internal Hadoop bundled jars. Now, the interesting part is, if you have installed hadoop and then use pig without starting the Hadoop, then sometimes it will not work because the of Hadoop version mismatch. So to be in safe side start Hadoop explicitly. So, Pig always uses Hadoop. :)
3) Always use Hadoop local mode if the file size is less. As already explained, Pig by default comes with Hadoop jars.
4) Yes you need to set this, if you are using Hadoop explicitly.
Upvotes: 3
Reputation: 230
Local mode will literally run Pig, HDFS and MR1 (or YARN+MR2) in one JVM.
It's not really relevant to compare performance difference in local vs cluster modes. Local mode is generally used for testing or running small MR jobs that can work on 1 node.
With regards to pig-withouthadoop.jar, I can see how the jar's name can be construed to mean that Pig won't using Hadoop. But that is not the case.
Pig packages two jars relevant to execution:
Either ways you'll need to ensure hadoop configs hdfs-site.xml, mapred-site.xml etc are in standard location (/etc/hadoop/conf/ typically) for Pig to work.
Upvotes: 2