Argho Chatterjee
Argho Chatterjee

Reputation: 599

Diffence between Pig on local mode vs pig-withouthadoop.jar

I wanted to know that what is the performance gain or loss if I use pig in local mode (which internally calls Map reduce) vs using PIG-withouthadoop.jar file.?

Does PIG-withouthadoop.jar really does not use hadoop ???

And If I only want to use Pig without clusters, like design a data flow, then what should I use,? Pig in local mode OR pig-withouthadoop.jar file??

Currently I have written my script using pig local mode and while trying to deploy in server and set up PIG in local mode, I think I also need HADOOP_HOME to be set in the environment variables before setting the PIG_HOME variable

Kindly advice ..

Thanks in advance. :)

Upvotes: 0

Views: 828

Answers (2)

hello_abhishek
hello_abhishek

Reputation: 545

Let me answer your question in a sequence:

1) When we talk about performance, then if we assume the file size and the Pig script to be constant, while running in local mode and Hadoop mode. Then, definitely the processing will be faster in local mode as all the task is getting performed in a single JVM and but in case of Hadoop mode, the input file will be carried to the data nodes, then the Pig script or UDFs will also get carried to the cluster. This will demand more time, although, in both the cases the pig scripts and UDFs will internally get converted to map and reduce task and also the number of map and reduce class constructed will always be same in both the cases. We can check this by using EXPLAIN command.

2) No. Pig internally contains a bundle of Hadoop jars. So, if you haven't started the Hadoop by using start-all.sh command, pig will work as it uses the internal Hadoop bundled jars. Now, the interesting part is, if you have installed hadoop and then use pig without starting the Hadoop, then sometimes it will not work because the of Hadoop version mismatch. So to be in safe side start Hadoop explicitly. So, Pig always uses Hadoop. :)

3) Always use Hadoop local mode if the file size is less. As already explained, Pig by default comes with Hadoop jars.

4) Yes you need to set this, if you are using Hadoop explicitly.

Upvotes: 3

user3730028
user3730028

Reputation: 230

Local mode will literally run Pig, HDFS and MR1 (or YARN+MR2) in one JVM.

It's not really relevant to compare performance difference in local vs cluster modes. Local mode is generally used for testing or running small MR jobs that can work on 1 node.

With regards to pig-withouthadoop.jar, I can see how the jar's name can be construed to mean that Pig won't using Hadoop. But that is not the case.

Pig packages two jars relevant to execution:

  • pig.jar, which is an "uber jar" that also includes all hadoop and mapreduce jars. You can literally take that jar on a box which does not already have hadoop installed, and run pig (after setting the right configs and environment.)
  • But most clusters already have hadoop installed and configured. In that case, you use pig-withouthadoop.jar. This jar is half the size of the uber jar, for obvious reasons.

Either ways you'll need to ensure hadoop configs hdfs-site.xml, mapred-site.xml etc are in standard location (/etc/hadoop/conf/ typically) for Pig to work.

Upvotes: 2

Related Questions