George
George

Reputation: 63

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.

I'm trying the following command:

bin/hadoop fs -put inputFolder/inputFile inputHDFS/

however it says

put: File inputFolder/inputFile does not exist

I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?

thanks :)

Upvotes: 1

Views: 5207

Answers (1)

sa125
sa125

Reputation: 28971

Hopefully this isn't overkill:

Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):

export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin

Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:

  • In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.

  • With an actual HDFS running, I use the copyFromLocal command (I find it to just work):

      $ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
    

Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:

# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/ 

For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.

Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:

$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc

This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.

Again, all this assumes you've made it through the setup stages successfully (no small feat).

Hope this wasn't too confusing - good luck!

Upvotes: 2

Related Questions