Reputation: 235
I am new to Linux and Apache Pig. I am following this tutorial to learn pig: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm
This is a basic word counting example. The data file 'input.txt' and the program file 'wordcount.pig' are in the Wordcount package, linked on the site.
I already have Pig 0.11.1
downloaded on my local machine, as well as Hadoop
, and Java 6
.
When I downloaded the Wordcount package it took me to a "tar.gz
" file. I am unfamiliar with this type, and wasn't sure how to extract it.
It contains the files 'input.txt','wordcount.pig' and a Readme file. I saved 'input.txt
' to my Desktop. I wasn't sure where to save wordcount.pig, and decided to just type in the commands line by line in the shell.
I ran pig in local mode as follows:pig -x local
and then I just copy-pasted each line of the wordcount.pig script at the grunt> prompt like this:
A = load '/home/me/Desktop/input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;
This generates the following errors: ...
Retrying connect to server: localhost/127.0.0.1:8021. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected error during execution.
My questions:
1. Should I be saving 'input.txt' and the original 'wordcount.pig' script to some special folder inside the directory pig-0.11.1? That is, create a folder called word inside pig-0.11.1 and put 'wordcount.pig' and 'input.txt' there and then type in "wordcount.pig" from the grunt> prompt ??? In general, if I have data in say, 'dat.txt', and a script say, 'program.pig', where should I be saving them to run 'program.pig' from the grunt shell??? I think they should both go in pig-0.11.1,so I can do $ pig -x local wordcount.pig, but I am not sure.
2. Why am I not able to run the script line by line as I tried to? I have specified the location of the file 'input.txt' in the load statement. So why does it not just run the commands line by line and dump the contents of D to my screen???
3. When I try to run Pig in mapreduce mode using $pig, it gives this error:
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-06-03 23:57:06,956 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage
Upvotes: 0
Views: 5359
Reputation: 1
2043 error occurs when hadoop and pig fail to communicate with each other.
Never do a right click --> extract here, when dealing with tar.gz files.
U shud always do a tar -xzvf *.tar.gz on terminal when extracting them.
I noticed that pig doesn't get installed properly when u do a right click on pig..tar.gz file and select extract here. It's good to do a tar -xzvf pig..tar.gz from terminal.
Make sure u are running Hadoop before u execute pig -x local kind of commands.
If u want to run *.pig files from grunt> prompt, use: grunt> exec *.pig
If u want to run pig files outside grunt> prompt, use: $ pig -x local *.pig
Upvotes: 0
Reputation: 5811
This error indicates that Pig is unable to connect to Hadoop to run the job. You say you have downloaded Hadoop -- have you installed it? If you have installed it, have you started it up according to its docs -- have you run the bin/start-all.sh
script? Using -x local
tells Pig to use the local filesystem instead of HDFS, but it still needs a running Hadoop instance to perform the execution. Before trying to run Pig, follow the Hadoop docs to get your local "cluster" set up and make sure your NameNode
, DataNode
s, etc. are up and running.
Upvotes: 3